A novel approach for practical real-time, machine learning ... · A Novel Approach for Practical,...
Transcript of A novel approach for practical real-time, machine learning ... · A Novel Approach for Practical,...
A Novel Approach for Practical,Real-Time, Machine Learning Based
IP Traffic Classification
Dissertation submitted in accordance with the requirements forthe degree of Doctor of Philosophy
Thuy T.T. Nguyen
Centre for Advanced Internet ArchitecturesFaculty of Information and Communication Technologies
Swinburne University of TechnologyMelbourne, Australia
February 2009
Declaration
To the best of my knowledge and belief, this thesis contains no material previously published or
written by any other person, except where due reference is made in the text of the thesis. This
thesis has not been submitted previously, in whole or in part, to qualify for any other academic
award. The content of the thesis is the result of work which has been carried out since the
beginning of my candidature on March 2003.
Melbourne, 23rd February 2009
Thuy Nguyen
2
A Novel Approach for Practical, Real-Time,Machine Learning BasedIP Traffic Classification
4
c© Thuy Nguyen 2009. All rights reserved.
To my beloved family, especially
my husband Uy Dzung and my little daughter Khiet Linh!
Acknowledgements
Working towards this PhD was a long and challenging journey, and I would like to thank the
following people for making it possible.
First of all, I would like to express my sincere gratitude and appreciation to my first super-
visor, Professor Grenville Armitage. Throughout the years he has been a great mentor to me. I
have experienced both successful and frustrating experimental outcomes, sometimes losing my
track, and it was his guidance, support, encouragement, enthusiasm and passion that helped me
stay inspired and motivated. I feel so grateful to have had a supervisor who is willing to stand
up for his students; who tries hard to provide a great working environment with all the neces-
sary facilities and equipment for our experiments and research; and who creates opportunities
for us to present our work and establish networking connections at both local and international
conferences and workshops. Finally, I would like to thank him for all his patience during many
long hours of discussions and experiments, for teaching me how to do good research, how to
write a good paper, etc. All of these have really built a solid grounding for my future research
career.
I would like to thank Dr Philip Branch, my colleague and recently my second supervisor,
who has always been willing to provide me with help, support and advice when needed. I
am thankful for the encouragement he has given me since the early days of starting my thesis.
I really appreciate all the time he spent helping me review my work and offering valuable
suggestions and feedback. I would also like to thank Dr Jim Lambert (who was my second
supervisor for the first two years of my candidature and is now retired) for his support of my
work.
I owe special thanks to my colleagues, Sebastian Zander and Nigel Williams, for the inspi-
ration of their work, that ultimately led me to my thesis. My thanks to them for always being so
helpful and kind to me over the years. I deeply thank Warren Harrop and Lawrence Stewart, for
7
their kindness and generosity in giving me the VoIP data trace collected at their home network
to support my research. I would also like to thank Dragi Klimovski for giving me the opportu-
nity to attend and study his Cisco CCNP class – to widen my knowledge and gain experience
which benefited my research. To my other colleagues at the Centre for Advanced Internet Ar-
chitectures, I must say that I have been so lucky to have a chance to work in a great research
environment, with such very smart, helpful and nice people – thank you all!
I would like to thank the Swinburne IT Services Department and the Centre for Astrophysics
and Supercomputing for providing the laboratory equipment that facilitated my research.
I would also like to thank the Centre for Advanced Internet Architectures, Cisco Systems
Australia, and Swinburne University of Technology for awarding me the Swinburne University
Postgraduate Research Award (SUPRA) and for providing funding support for the duration of
my candidature.
I would like to thank my husband – Uy Dzung – for walking with me on this journey with
his infinite support, love, and encouragement. Thanks to my parents, my sisters and my parents-
in-law for their patience and understanding. Completing this PhD would not have been possible
without the friendship of many special people. This is a sincere thank you to all of them. Last
but not least, this thesis is specially dedicated to Khiet Linh, my dear little daughter, who has
been separated from mommy for many months so that I could complete my thesis!
Contents
Acknowledgements 6
Abstract 14
Publications 17
Table of Acronyms 19
1 Introduction 21
2 Application Context for ML Based IPTC 28
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.2 The importance of IP traffic classification . . . . . . . . . . . . . . . . . . . . 30
2.2.1 QoS issues over Last Mile networks . . . . . . . . . . . . . . . . . . . 30
2.2.2 QoS provisioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Internet QoS standards . . . . . . . . . . . . . . . . . . . . . . . . . . 32
QoS-enabled solutions from industry . . . . . . . . . . . . . . . . . . 32
Automated QoS solution . . . . . . . . . . . . . . . . . . . . . . . . . 33
The role of IP traffic classification . . . . . . . . . . . . . . . . . . . . 33
2.2.3 Internet pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.2.4 Lawful interception . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3 Traffic classification metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.1 Positives, negatives, accuracy, precision and recall . . . . . . . . . . . 37
2.3.2 Byte and flow accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.4 Limitations of packet inspection for traffic classification . . . . . . . . . . . . . 40
2.4.1 Port-based IP traffic classification . . . . . . . . . . . . . . . . . . . . 40
2.4.2 Payload-based IP traffic classification . . . . . . . . . . . . . . . . . . 41
8
CONTENTS 9
2.5 Classification based on statistical traffic properties . . . . . . . . . . . . . . . . 42
2.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
3 A Brief Background on Machine Learning 44
3.1 A review of classification with Machine Learning . . . . . . . . . . . . . . . . 44
3.1.1 Input and output of an ML process . . . . . . . . . . . . . . . . . . . . 45
3.1.2 Different types of learning . . . . . . . . . . . . . . . . . . . . . . . . 46
3.1.3 Supervised learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
The Naive Bayes algorithm . . . . . . . . . . . . . . . . . . . . . . . . 47
The C4.5 Decision Tree algorithm . . . . . . . . . . . . . . . . . . . . 49
3.1.4 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
EM algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
3.1.5 Evaluating supervised learning algorithms . . . . . . . . . . . . . . . . 52
3.1.6 Evaluating unsupervised learning algorithms . . . . . . . . . . . . . . 54
3.1.7 Feature selection algorithms . . . . . . . . . . . . . . . . . . . . . . . 55
3.1.8 Imbalanced datasets problem . . . . . . . . . . . . . . . . . . . . . . . 55
3.2 The application of ML in IP traffic classification . . . . . . . . . . . . . . . . . 57
3.2.1 Training and testing a supervised ML traffic classifier . . . . . . . . . . 59
3.2.2 Supervised versus unsupervised learning . . . . . . . . . . . . . . . . 62
3.3 Challenges for operational deployment . . . . . . . . . . . . . . . . . . . . . . 63
3.3.1 A deployment scenario . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.3.2 The operational challenges . . . . . . . . . . . . . . . . . . . . . . . . 66
Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Timely and continuous classification . . . . . . . . . . . . . . . . . . . 66
Directional neutrality . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Efficient use of memory and processors . . . . . . . . . . . . . . . . . 67
Portability and Robustness . . . . . . . . . . . . . . . . . . . . . . . . 68
3.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4 IP Traffic Classification Using Machine Learning 70
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.2 Clustering approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
10 CONTENTS
4.2.1 Flow clustering using Expectation Maximisation . . . . . . . . . . . . 71
4.2.2 Automated application identification using AutoClass . . . . . . . . . 72
4.2.3 TCP-based application identification using Simple K-Means . . . . . . 73
4.2.4 Identifying HTTP and P2P traffic in the network core . . . . . . . . . . 75
4.3 Supervised learning approaches . . . . . . . . . . . . . . . . . . . . . . . . . 76
4.3.1 Statistical signature-based approach using NN, LDA and QDA algorithms 76
4.3.2 Classification using Bayesian analysis techniques . . . . . . . . . . . . 77
4.3.3 GA-based classification techniques . . . . . . . . . . . . . . . . . . . 78
4.3.4 Simple statistical protocol fingerprint method . . . . . . . . . . . . . . 79
4.4 Hybrid approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
4.5 Comparisons and related work . . . . . . . . . . . . . . . . . . . . . . . . . . 81
4.5.1 Comparison of different clustering algorithms . . . . . . . . . . . . . . 81
4.5.2 Comparison of clustering versus supervised techniques . . . . . . . . . 82
4.5.3 Comparison of different supervised ML algorithms . . . . . . . . . . . 83
4.5.4 ACAS: Classification using machine learning techniques on application
signatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
4.5.5 BLINC: Multilevel traffic classification in the dark . . . . . . . . . . . 85
4.5.6 Pearson’s Chi-Square test and Naive Bayes classifier . . . . . . . . . . 86
4.6 Limitations of the reviewed works . . . . . . . . . . . . . . . . . . . . . . . . 87
4.6.1 Timely and continuous classification . . . . . . . . . . . . . . . . . . . 87
4.6.2 Directional neutrality . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4.6.3 Efficient use of memory and processors . . . . . . . . . . . . . . . . . 88
4.6.4 Portability and Robustness . . . . . . . . . . . . . . . . . . . . . . . . 88
4.7 My research goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.8 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5 Training Using Multiple Sub-Flows for Real-Time IPTC 90
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.2 My proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
5.3 My experimental approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.3.1 Flows and features . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
CONTENTS 11
5.3.2 Machine Learning algorithms . . . . . . . . . . . . . . . . . . . . . . 94
5.3.3 Some statistical properties of ET traffic . . . . . . . . . . . . . . . . . 95
5.3.4 Constructing training and testing datasets . . . . . . . . . . . . . . . . 100
ET traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Other traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
Training with full-flow, testing with four different sliding windows . . . 102
Training with full-flow instances of more than 25 packets (called filtered
full-flow), testing with a sliding window of N = 25 packets . . 103
Training with individual sub-flow, testing with a sliding window of N =
25 packets . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Training with multiple sub-flows, testing with a sliding window of N =
25 packets . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3.5 Data processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
5.4 Results and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.4.1 Training with full-flows, testing with four different sliding windows . . 109
5.4.2 Training with filtered full-flows, testing with a sliding window of N =
25 packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.4.3 Training with individual sub-flows, testing with a sliding window of N
= 25 packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5.4.4 Training with multiple sub-flows, testing with a sliding window of N =
25 packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
6 Clustering For Automated Sub-Flow Selection 128
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
6.2 My proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
6.2.1 Step 1 - Sub-flow identification . . . . . . . . . . . . . . . . . . . . . 130
6.2.2 Step 2 - Sub-flows selection . . . . . . . . . . . . . . . . . . . . . . . 131
6.3 An experimental illustration of my proposal . . . . . . . . . . . . . . . . . . . 133
6.3.1 Step 1 - Sub-flow identification . . . . . . . . . . . . . . . . . . . . . 133
12 CONTENTS
6.3.2 Step 2 - Sub-flow selection . . . . . . . . . . . . . . . . . . . . . . . . 134
6.3.3 Evaluation of classifiers trained with sub-flows selected by EM . . . . . 137
6.4 Results and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
6.4.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
6.4.2 Computational performance . . . . . . . . . . . . . . . . . . . . . . . 144
6.4.3 Summary of results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
6.5 Sampling for faster clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.5.1 The problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.5.2 Down-sampling for the clustering proposal . . . . . . . . . . . . . . . 152
6.5.3 Results and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.6 Discussion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
6.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
7 Training Using Synthetic Sub-Flow Pairs 162
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
7.2 Proposal using a synthetic sub-flow pairs approach . . . . . . . . . . . . . . . 163
7.3 Illustrating the Synthetic Sub-Flow Pairs Training Approach . . . . . . . . . . 168
7.3.1 Experimental data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
7.3.2 Test methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
7.4 Results and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
7.4.1 Classifying without training on SSP . . . . . . . . . . . . . . . . . . . 170
7.4.2 Training on SSP Option 1, classifying with a sliding window . . . . . . 172
7.4.3 Training on SSP Option 2, classifying with a sliding window . . . . . . 176
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
8 Training Using SSP-ACT 183
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
8.2 Evaluation of SSP-ACT in identifying VoIP traffic . . . . . . . . . . . . . . . . 184
8.2.1 A brief background on ITU-T G.711 PCMU and GSM 06.10 encoded
voice traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
8.2.2 Data collection and research methodology . . . . . . . . . . . . . . . . 185
Statistical properties of VoIP flows . . . . . . . . . . . . . . . . . . . . 186
CONTENTS 13
8.2.3 Results and analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
8.3 Evaluation of SSP-ACT in the presence of additional packet loss . . . . . . . . 194
8.3.1 Impact of packet loss on the classification of ET traffic . . . . . . . . . 198
8.3.2 Impact of packet loss on the classification of VoIP traffic . . . . . . . . 200
8.4 Concurrent classification of multiple applications with SSP-ACT . . . . . . . . 204
8.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
8.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
9 Conclusion 214
Bibliography 218
List of Figures 240
List of Tables 240
A Traffic Characteristics of Selected Internet Applications 251
A.1 Asymmetric properties in bi-directional communication . . . . . . . . . . . . . 251
A.1.1 Client server ports asymmetry . . . . . . . . . . . . . . . . . . . . . . 251
A.1.2 Statistical Properties Asymmetry . . . . . . . . . . . . . . . . . . . . 252
A.2 Variation of traffic statistics during flow lifetime . . . . . . . . . . . . . . . . . 254
B Summary of ML-Based IP TC works in the Literature 258
B.1 A summary of key points for each reviewed work . . . . . . . . . . . . . . . . 258
B.2 A qualitative evaluation of the reviewed works . . . . . . . . . . . . . . . . . . 258
C Some Properties of Data Used for Training and Testing 265
C.1 Geographical distribution of ET traffic . . . . . . . . . . . . . . . . . . . . . . 265
C.2 Traffic mix for training and testing . . . . . . . . . . . . . . . . . . . . . . . . 266
D Characteristics of VoIP Traffic 271
D.1 VoIP data extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
D.2 Statistical properties of G.711 and GSM flows . . . . . . . . . . . . . . . . . . 272
E Trade-offs in Cluster Quality and Classifier Performance 274
E.1 Accuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 275
E.2 Computational performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 277
Abstract
Today’s Internet does not guarantee any bounds on packet delay, loss or jitter for traffic travers-
ing its networks. Uncontrolled networks can easily lead to bad user experiences for those emerg-
ing applications that have more stringent Quality of Service (QoS) requirements. This suggests
there is a vital need for an effective QoS-enabled network architecture, in which the network
equipment is capable of classifying Internet traffic into different classes for different QoS treat-
ments. Beyond technology, there are other issues related to a practical QoS solution for the
Internet, including the challenges of minimising the deployment cost of QoS technologies and
simplifying users’ experiences. Like other services, the Internet is expected to be user-friendly,
simple and easy to understand, stable and available on request, predictable and transparent, and
not requiring users to understand its underlying architecture in order to use the service.
With an awareness of these issues, my thesis focuses on the automation of the QoS control
process, particularly by means of an automated, real-time IP traffic classification (IPTC) mech-
anism. Traditional techniques for the identification of Internet applications are based either on
the use of well-known registered port numbers or on payload-based protocol reconstruction.
However, applications can use unregistered ports or encryption to obfuscate packet contents;
and governments may impose privacy regulations that constrain the ability of third parties to
lawfully inspect packet payloads. Newer approaches, on the other hand, classify traffic by learn-
ing and recognising statistical patterns in externally observable attributes of the traffic (such as
packet lengths and inter-packet arrival times). State-of-the-art approaches look closely at the
application of Machine Learning (ML) – a powerful technique for data mining and knowledge
discovery – to the classification of IP traffic.
However, before I began publishing my work no ML-based approach to IPTC properly con-
sidered the constraints of being deployed in real-time operational networks. Most publications
on the use of ML algorithms for classifying IP traffic have relied on bi-directional, full-flow
15
statistics (from start until finish or time-out), while assuming that flows have an explicit direc-
tion implied by the first packet captured, or a known client-server relationship. Some other
studies have tried classification using the first few packets of a flow. In contrast, most if not
all real-world scenarios require a classification decision well before a flow has finished, using
statistics derived from a small number of recent packets rather than from the entire flow. Clas-
sifiers may also have missed an arbitrary number of packets from the start of a flow, and be
unsure of the direction in which the flow started.
To overcome these problems, I propose and evaluate novel modifications to the current ML-
based approaches. My goal is to achieve classification by using statistics derived from only the
most recent N packets of a flow (for some small value of N). Because a target application’s
short-term traffic statistics vary within the lifetime of a single flow, I propose training the ML
classifier on a set of multiple short sub-flows, each ‘sub-flow’ being a collection of N consec-
utive packets extracted from full-flow samples of the target application’s traffic. The sub-flows
are picked from regions of the application’s flow that have noticeably different statistical char-
acteristics. I further augment the training set by synthesising a complementary version of every
sub-flow in the reverse direction, since most Internet applications exhibit asymmetric traffic
characteristics in the client-to-server and server-to-client directions. Finally, I propose a novel
use of unsupervised ML algorithms for the automated selection of appropriate sub-flow pairs
when examples of traffic are given from applications that we wish to classify.
I combine my proposals into a training approach that I call Synthetic Sub-flow Pairs with the
Assistance of Clustering Techniques (SSP-ACT). I demonstrate my optimisation when applied
to the Naive Bayes and C4.5 Decision Tree ML algorithms, for the identification of an online
game – Wolfenstein Enemy Territory (ET) and VoIP traffic. My experiments showed that for
ET, being trained using SSP-ACT and classifying using a small sliding classification window
of 25 packets (roughly corresponds to 0.5 of a second in real-time), the Naive Bayes classifier
achieved 98.9% median Recall and 87% median Precision, and the C4.5 Decision Tree classifier
achieved 99.3% median Recall and 97% median Precision. My results also confirmed that
classification performance is maintained even when the classification is initiated at an arbitrary
point within a flow and is independent of the direction of the first packet captured.
For VoIP, being trained using SSP-ACT and classifying on a sliding window of 25 packets
16
(approximately 0.25 seconds in real-time when there is voice traffic in both directions), the
Naive Bayes classifier achieved 100% median Recall and 95.4% median Precision, and the
C4.5 Decision Tree classifier achieved 95.7% median Recall and 99.2% Precision.
I also study the impact of packet loss on SSP-ACT’s performance, with 5% synthetic, ran-
dom and independent packet loss. For Wolfenstein Enemy Territory traffic, 5% packet loss only
degraded the Recall and Precision of both the Naive Bayes and C4.5 Decision Tree classifiers
by less than 0.5%. For VoIP traffic, 5% packet loss did not manifest noticeable degradation
on the Naive Bayes classifier’s Recall and Precision. However, it degraded the C4.5 Decision
Tree classifier’s Recall and Precision by 8.5% and 0.1% respectively. Despite this degradation,
median Recall and Precision of the C4.5 Decision Tree classifier still remained above 87% and
99% for all the tested positions of the sliding window. Deeper investigation of the sensitivity of
the Naive Bayes and C4.5 Decision Tree classifiers with regards to packet loss is left for future
research. This work also can be expanded in future with other loss rates and loss models.
I also demonstrate that SSP-ACT is effective in identifying both ET and VoIP traffic con-
currently, by using a single common classifier or two separate classifiers in parallel, one for
each application. My results reveal that using a common classifier provides better Precision and
Recall, with a trade-off in the classification speed. It also has several pros and cons compared
to the latter option of using two separate classifiers. How SSP-ACT could scale to classify a
larger number of applications simultaneously is a question that requires further study.
My results show that SSP-ACT is a significant improvement over the previous, published
state-of-the art for IP traffic classification. My present work has focused on IPTC of an online
game and VoIP, and revealed a potential solution to the accurate and timely classification of
traffic belonging to other Internet applications.
Publications
A number of peer-reviewed papers have been published based on material and discussion in this
thesis, as listed below.
Peer-reviewed Journal Papers:
• T.T.T. Nguyen and G. Armitage, “A Survey of Techniques for Internet Traffic Classifica-
tion using Machine Learning,” IEEE Communications Surveys & Tutorials, Vol. 10, No.
4, 2008
• J. But, T.T.T. Nguyen, G. Armitage, “The Brave New World of Online Digital Home
Entertainment,” IEEE Communications, May 2005
• T.T.T. Nguyen, G. Armitage, “Evaluating Internet Pricing Schemes - A Three Dimen-
sional Visual Model,” ETRI Journal, Vol. 27, No. 1, February 2005.
Peer-reviewed Conference Papers:
• T.T.T Nguyen and G. Armitage, “ Clustering to Assist Supervised Machine Learning for
Real-Time IP Traffic Classification,” in Proc. 2008 IEEE International Conference on
Communications, pp. 5857-5862. Beijing, China, 19-23 May 2008.
• T.T.T. Nguyen, G. Armitage, “Synthetic Sub-flow Pairs for Timely and Stable IP Traffic
Identification,” in Proc. Australian Telecommunication Networks and Application Con-
ference, Melbourne, Australia, December 2006.
• T.T.T. Nguyen, G. Armitage, “Training on Multiple Sub-flows to Optimise the Use of
Machine Learning Classifiers in Real-world IP Networks,” in IEEE 31st Conference on
Local Computer Networks, Tampa, Florida, USA, November 2006.
18
• S. Zander, T.T.T. Nguyen, G. Armitage, “Automated Traffic Classification and Appli-
cation Identification using Machine Learning,” Proc. IEEE 30th Conference on Local
Computer Networks (LCN 2005), Sydney, Australia, November 2005
• S. Zander, T.T.T. Nguyen, G. Armitage, “Self-learning IP Traffic Classification based
on Statistical Flow Characteristics,” Passive Active Measurement Workshop (PAM) 2005,
Boston, USA, March/April 2005.
• T.T.T.Nguyen, G. Armitage, “Experimentally Derived Interactions between TCP Traffic
and Service Quality over DOCSIS Cable Links,” Proc. Of Global Internet and Next
Generation Networks Symposium, IEEE Globecom 2004, Texas, USA, November 2004.
• T.T.T. Nguyen, G. Armitage, “Quantitative Assessment of IP Service Quality in 802.11b
Networks,” The 3rd Workshop on the Internet, Telecommunications and Signal Process-
ing (WITSP’04), Adelaide, Australia, December 2004.
• T.T.T. Nguyen, G. Armitage, “Quantitative Assessment of IP Service Quality in 802.11b
and DOCSIS networks,” The Australian Telecommunication Networks and Applications
Conference (ATNAC 2004), Sydney, Australia, December 2004.
• T.T.T. Nguyen, G. Armitage, “Pricing the Internet - A Visual 3-Dimensional Evaluation
Model,” Proc. of Australian Telecommunications Networks and Applications Conference
(ATNAC), Melbourne, Australia, December 2003
Table of Acronyms
CM Cable Modem
DBSCAN Density Based Spatial Clustering of Applications with Noise
DiffServ Differentiated Services
DNS Domain Name System
DOCSIS Data Over Cable Service Interface Specifications
DS Downstream
ET Wolfenstein Enemy Territory
FPS First Person Shooter
FTP File Transfer Protocol
HTTP HyperText Transfer Protocol
IMAP Internet Message Access Protocol
IntServ Integrated Services
IP Internet Protocol
IPTC IP Traffic Classification
ISP Internet Services Provider
Kbyte Kilo byte, which is equal to 1024 bytes
LI Lawful Interception
Mbps Mega bits per second
Mbyte Mega byte, which is equal to 1024 Kbytes
ML Machine Learning
MPLS Multi Protocol Label Switching
NTP Network Time Protocol
QoS Quality of Service
20
RTT Round Trip Time
SMTP Simple Mail Transfer Protocol
SSP-ACT Synthetic Sub-Flow Pairs with the Assistance of Clustering Techniques
US Upstream
HFC Hybrid Fibre Coaxial Network
CMTS Cable Modem Termination System
ACK Acknowledgement
P2P Peer to peer
ICMP Internet Control Message Protocol
MTU Maximum Transmission Unit
FN False Negatives
FP False Positives
TP True Positives
TN True Negatives
MSS Maximum Segment Size
Chapter 1
Introduction
The Internet has now been a part of our lives for more than 30 years, since the first public demon-
stration of the ARPANET network technology in 1972 [1]. Its use has been growing rapidly over
the years with increases not only in the number of users [2][3], hosts and servers [4], networks
and autonomous systems [5], but also in volume and types of traffic [6]. Traditional Internet
applications, such as electronic mail, file transfer, and static-content web sites, are being joined
by newer services such as IP telephony, real-time interactive audio and video conferencing,
streaming of multimedia content, online games, and electronic commerce. This creates a wide
range of household [7][8][9] and business Internet uses [10][11][12][13][14][15][16]. This
expanding trend is driven further by the rapid development of computing and communications
in portable forms (e.g. laptop computers, PDAs and cellular phones), along with new modes of
Internet access (e.g. from dial-up to broadband to possible optical access networks in the future
[17][18]), which will potentially spawn more new applications and services.
With these developing trends, parameters such as the timeliness of data delivery, packet loss
and variability in end-to-end packet delay (jitter) become more important for Internet quality of
service (QoS). Traditional non-interactive applications, such as bulk data transfer (FTP), backup
operations or database synchronising, can span their operations over a long period of non-peak
time as background activities [19]. On the other hand, emerging interactive applications such as
business transactions and Web surfing are delay-sensitive; waiting times for these are tolerable
only to the order of seconds [20]. Even less tolerant to delay are those applications that need
to satisfy human requirements for interactivity, such as real-time voice communication and
networked online games. The delay limits for these application types are a fraction of a second
21
22 CHAPTER 1. INTRODUCTION
[21][22][23]. Similarly, video performance can suffer from jerky appearance due to jitter and
frame distortion resulting from packet loss [24]. For voice applications, the loss of two or
more consecutive voice samples may result in noticeable degradation of voice quality [19]. In
various studies, online game applications have also been shown to be sensitive to network delay,
loss and jitter ( [25][26][27][28][29]).
Finding viable solutions for QoS-enabled Internet has attracted considerable research effort
since the early 1990s, with the introduction of the Integrated Services (IntServ) [30], Differen-
tiated Services (DiffServ) [31], and Multi Protocol Label Switching (MPLS) [32] architectures.
However, the introduction of these architectures has yet to make a significant impact on the QoS
perceived by Internet end users. Most networks and applications are still dominated by ‘Best-
Effort’ services, in which the network provides no guarantee on the bounds of packet delay, loss
or jitter.
One reason for the poor uptake of these QoS approaches is the lack of an effective mecha-
nism that allows applications to signal their explicit QoS requirements to the underlying network
[33]. One option is to leave this task to the applications or to the users. However, it might be
unreasonable to expect software developers to be aware of the network issues or to understand
the underlying technologies and explicit network requirements for providing QoS for their ap-
plications. Furthermore, tying an application to a particular standard for QoS provisioning, or
requiring complicated user intervention or knowledge may restrict its options for deployment
[33][34]. An alternative solution is to shift QoS signalling from the application to the network
[33][35]. In this approach, the network is equipped with intelligent devices that can automat-
ically classify traffic in terms of QoS demands, and prompt the ISP’s QoS control system to
provide appropriate QoS treatment.
There are also other issues beyond those related to technology to be faced in order to achieve
a successful Internet QoS solution [36]. These are the challenges of minimising the deployment
cost of QoS technologies and simplifying users’ experiences. For ISPs, implementation and
operational costs must not exceed the revenues likely to be gained by deploying any new QoS
scheme. ISPs may also resist deploying a complex technology if there are questions as to its
reliability and operational effort [37][38]. For Internet users, the Internet is expected to be user-
friendly, simple to understand, stable and available on request, predictable and transparent, and
23
should not require that users understand the underlying architecture in order to use the service
[20][37][39][40].
The work of this thesis is motivated by the desire to find a good solution for Internet QoS.
My literature review on the QoS problem space suggests that a network based, robust and auto-
mated real-time IP traffic classification technique is an important component for implementing
QoS across the Internet. IP traffic classification (which I will refer to as IPTC) is the process
of identifying and classifying an individual Internet application or a group of applications of
interest. It can serve as a core part of an automated QoS-enabled architecture, assist the QoS
signalling process by quickly identifying the traffic of interest, and trigger an automated QoS
control system for allocation of network resources for priority applications. Real-time IPTC
allows network operators to know in good time what is flowing over their networks, so they
can react quickly in support of their various business goals. It also has the potential to support
class-based QoS accounting and billing. More importantly, it can be done automatically by the
network providers, and does not require users’ intervention or specialist knowledge about the
underlying technologies. It can help to bring QoS to consumers in a user friendly way. Further-
more, IPTC can assist in automated intrusion detection [41][42]. Recently, governments have
also been clarifying ISP obligations with respect to ‘lawful interception’ (LI) of IP traffic [43].
IPTC is an integral part of ISP-based LI solutions [44][45][46].
Traditional techniques for identifying Internet applications are typically based on the use
of well-known registered port numbers or on payload-based protocol reconstruction. How-
ever, applications may use unregistered ports or encryption to obfuscate packet contents and
governments may impose privacy regulations constraining the ability of third parties to law-
fully inspect packet payloads. Newer approaches classify traffic by learning and recognising
statistical patterns in externally observable attributes of the traffic (such as packet lengths and
inter-packet arrival times). In particular, state-of-the-art techniques include the application of
Machine Learning (ML) – a powerful technique for data mining and knowledge discovery – to
IPTC.
However, the literature of ML-based IPTC has not properly considered the constraints of be-
ing deployed in real-time operational networks. Most published work has primarily focused on
the efficacy of different ML algorithms when applied to entire datasets of IP traffic. Classifica-
24 CHAPTER 1. INTRODUCTION
tion models typically rely on flow statistical properties measured over full-flows (from their start
until they finish or are timed out); some more recent work has attempted classification using the
first few packets of a flow. Yet in real networks, traffic classifiers must reach decisions well
before a flow has finished, so that network operators can react quickly to support their various
business goals, for example, for flow QoS mapping and priority treatment. The classifier may
start (or restart) at an arbitrary time and may not see the beginning of a flow. An application’s
statistical behaviour may change over its flow lifetime; in addition there may be thousands of
concurrent flows, and the classifier needs to operate with finite CPU and memory resources.
Further, although this has not always been clearly stated in the literature, directionality
has been an implicit attribute of the features on which ML classifiers were built and used.
Application flows in many cases are defined as bi-directional, and the application’s statistical
features are calculated separately in the forward and backward (reverse) directions. Most work
assumes that the forward direction is indicated by the first packet of the flow (on the basis that it
is commonly the initial packet from a client to a server). Subsequent evaluations assume that the
classifier sees the first packet of every flow, in order to calculate features with the correct sense
of direction. However, a real-world classifier cannot be sure whether the first packet it sees
(of any bi-directional flow of packets) is heading in the forward (client-to-server) or backward
(server-to-client) direction. Because, for many Internet applications, the traffic is asymmetric
in the client-to-server and server-to-client directions, this can lead to degraded classification
performance.
In contrast to previously published work, I consider not only the timeliness of an ML traffic
classifier, but also its sustainability in performance when monitoring traffic flows at any point in
their lifetime, given the constraints of limited physical resources. This makes the contribution
of my work novel and unique.
I propose that practical real-time traffic classifiers must accurately classify traffic in the face
of a number of constraints:
• The classifier should use statistical methods (such as ML algorithms), since TCP/UDP
port numbers may be misleading, and packet payloads may be opaque to direct interpre-
tation.
25
• ML classification should be done over a small sliding window of the last N packets (to
minimise memory requirements and perform classification in a timely manner).
• The classifier must only use features that require low processing/computation cost.
• Applications may change their network traffic patterns during the life of a flow.
• The classifier must recognise flows already in progress since the beginning of a flow may
be missed.
• The classifier does not need to know the direction the original flow takes. It can assume
the forward direction is the direction of the first packet of the most recent N packets it has
captured, regardless of whether this is from client to server or server to client.
My research question, therefore, is to investigate the possibility of building practical ML-
based real-time traffic classifiers that address all of the above requirements.
In this thesis I propose a novel approach to ML-based IPTC that I call the ‘Synthetic Sub-
Flow Pairs with the Assistance of Clustering Techniques’ (SSP-ACT) training method. Instead
of using the statistical properties of a flow calculated over its whole lifetime, or from its first few
packets, I train the ML classifier on a set of short sub-flows (each sub-flow contains a number of
consecutive packets extracted from full-flow examples of the target application’s traffic). This
allows the classifier to properly identify an application, regardless of where within a flow the
classifier begins capturing packets.
Dealing with the directionality issues, SSP-ACT further augments the training set by synthe-
sising a complementary version of every sub-flow in the reverse direction (hence the ‘synthetic
sub-flow pairs’ term). The first packet of a sliding window can alternatively represent traffic
between a client to a server or a server to a client. SSP-ACT trains the classifier to recognise
the application either way.
A limited number of representative sub-flows that best capture distinctive statistical varia-
tions of the full-flows are selected to train the classifier. SSP-ACT makes use of unsupervised
clustering ML techniques to automate the selection process.
I demonstrate the effectiveness of SSP-ACT by constructing an ML classifier designed to
identify highly interactive online game traffic mixed with thousands of unrelated interfering
26 CHAPTER 1. INTRODUCTION
traffic flows. I chose a popular First Person Shooter (FPS) game application (Wolfenstein En-
emy Territory (ET) [47]), the traffic characteristics of which can change significantly over the
lifetime of each flow, and are asymmetric in the client-to-server and server-to-client directions.
I evaluate the generality of SSP-ACT with the classification of another Internet application,
Voice over IP (VoIP) traffic. The characteristics of VoIP traffic differ remarkably from ET traf-
fic, to be more stable over a flow’s lifetime, and more symmetric in the forward and backward
directions. I also perform a preliminary investigation on the impacts of 5% random, indepen-
dent packet loss on the classification of VoIP and ET traffic. The scalability of SSP-ACT for
concurrent classification of multiple applications is also discussed.
I demonstrate that SSP-ACT can significantly improve a classifier’s performance using a
small sliding window, regardless of how many packets are missed from the beginning of each
flow and of the direction of the first packet of the most recent N packets used for the classifica-
tion. The classifiers trained using SSP-ACT maintain their accuracy well with the presence of
5% random, independent synthetic packet loss. I also demonstrate that SSP-ACT is effective in
identifying both ET and VoIP traffic concurrently, by using a single common classifier or two
separate classifiers in parallel, one for each application.
At the time of submitting this thesis, SSP-ACT has been implemented and used in an auto-
mated QoS-control system at Swinburne University of Technology [35], and has been demon-
strated to provide sub-second real-time classification of online game traffic.
My results show that SSP-ACT is a significant improvement over the previous, published
state-of-the art for IP traffic classification. Although the experiments are confined to online
game and VoIP applications, my results reveal a potential solution to the accurate and timely
classification of traffic belonging to other Internet applications.
The thesis is organised as follows.
In Chapter 2 I provide the context for IPTC in IP networks, and highlight its importance
in the areas of QoS provisioning, Internet accounting and charging, and lawful interception. I
then review the traditional methods of traffic classification, and highlight the motivations for
emerging ML-based IPTC techniques.
ML-based IPTC is interdisciplinary involving the areas of networking and data mining tech-
niques. It leverages data mining techniques to explore the large traffic statistical properties space
27
and to devise novel classification rules emerging from the data mining process. In Chapter 3, I
summarise the basic concepts of ML and how they can be applied to IPTC. I discuss a number
of key requirements for the employment of ML-based classifiers in operational IP networks,
which act as guidelines for my research presented in subsequent chapters.
In Chapter 4 I review significant works related to ML-based IPTC over the past. I discuss
their limitations with regards to the operational challenges addressed in Chapter 3. This helps
me define my research question with a justification of its originality and novelty and the reasons
why it is worth pursuing. The chapter is concluded with the problem statement for my thesis.
In Chapter 5 I present my novel modification to traditional ML training and classification
techniques, using a multiple sub-flows training method. I demonstrate that the method optimises
the classification of flows within finite periods of time, regardless of where within the flows’
lifetime the traffic is captured. My experiments are conducted on the Naive Bayes and C4.5
Decision Tree classifiers, with the goal to classify ET traffic against a number of other common
Internet applications.
In Chapter 6 I propose and demonstrate an automated approach based on the use of clus-
tering ML techniques to choose appropriate, representative sub-flows, from which effective
ML-based IP traffic classifiers may be trained.
In Chapter 7, I demonstrate the directional issues when a classifier is trained based on an as-
sumption of flow direction, which maybe wrong when classifying in real operational networks.
I propose and demonstrate that training on synthetic sub-flow pairs allows that the classifier
maintains its performance without relying on prior knowledge of inferred or actual directional-
ity of a flow.
Chapter 8 provides an evaluation of the overall SSP-ACT approach proposed. The effective-
ness of SSP-ACT is demonstrated with VoIP application. My preliminary investigation on the
impacts of 5% random, independent packet loss on the classification of VoIP and ET traffic is
presented. I also propose two different implementation options for the concurrent classification
of multiple applications, the pros and cons of which are discussed.
Chapter 9 concludes the thesis with final remarks and suggestions for future work.
Chapter 2
Application Context for ML Based IPTraffic Classification
2.1 Introduction
Real-time IP traffic classification (IPTC) has the potential to solve difficult network manage-
ment problems for Internet service providers (ISPs) and their equipment vendors. Network
operators need to know what is flowing over their networks promptly so they can react quickly
in support of their various business goals. Traffic classification may be a core part of automated
intrusion detection systems [48][42][49], used to detect patterns indicative of denial of service
attacks, to trigger automated re-allocation of network resources for priority customers [33], or to
identify customer use of network resources for accounting and billing purposes. More recently,
governments have also been clarifying ISP obligations with respect to ‘lawful interception’ (LI)
of IP data traffic [50]. Just as telephone companies must support interception of telephone
usage, ISPs are increasingly subject to government requests for information on network use
by particular individuals at particular points in time. IPTC is an integral part of ISP-based LI
solutions.
Commonly deployed IPTC techniques have been based around direct inspection of each
packet’s contents at some point on the network. Successive IP packets that have the same
five-tuple of protocol type, source address:port and destination address:port are considered to
belong to a flow whose controlling application we wish to determine. Simple classification
infers the controlling application’s identity by assuming that most applications consistently
use ‘well known’ TCP or UDP port numbers (visible in the TCP or UDP headers). However,
28
2.1. INTRODUCTION 29
many applications are increasingly using unpredictable (or at least obscure) port numbers [51].
Consequently, more sophisticated classification techniques infer application types by looking
for application-specific data (or well-known protocol behaviour) within the TCP or UDP pay-
loads [52].
Unfortunately, the effectiveness of such ‘deep packet inspection’ techniques is diminishing.
Such packet inspection relies on two related assumptions:
• Third parties unaffiliated with either source or recipient are able to inspect each IP packet’s
payload (i.e. the payload is visible).
• The classifier knows the syntax of each application’s packet payloads (i.e. the payload
can be interpreted).
Two emerging challenges undermine the first assumption – customers may use encryption to
obfuscate packet contents (including TCP or UDP port numbers), and governments may impose
privacy regulations constraining the ability of third parties to lawfully inspect payloads at all.
The second assumption imposes a heavy operational load – commercial devices would need
repeated updates to stay ahead of regular (or simply gratuitous) changes in every application’s
packet payload formats.
The research community has responded by investigating classification schemes capable of
inferring application-level usage patterns without deep inspection of packet payloads. Newer
approaches (e.g. [53], [54], [55], [56], [57] and [58]) classify traffic by recognising statistical
patterns in externally observable attributes of the traffic (such as typical packet lengths, inter-
packet arrival times, and flow duration and volume). The goal is to either cluster IP traffic flows
into groups that have similar traffic patterns, or classify one or more applications of interest.
A number of researchers are looking at the application of Machine Learning (ML) tech-
niques (a subset of the wider Artificial Intelligence discipline) to IPTC (e.g. [59], [60], [61]).
The application of ML techniques involves a number of steps. First, features are defined by
which future unknown IP traffic may be identified and differentiated. Features are attributes of
flows calculated over multiple packets (such as maximum or minimum packet lengths in each
direction, flow durations or inter-packet arrival times). The ML classifier is trained to associate
sets of features with known traffic classes (creating rules), and to apply the ML algorithm to
30 CHAPTER 2. APPLICATION CONTEXT FOR ML BASED IPTC
classify unknown traffic using the previously learned rules. Every ML algorithm has a different
approach to sorting and prioritising sets of features, which leads to different dynamic behaviours
during training and classification.
This chapter provides the rationale for IPTC in IP networks, reviews the traditional ap-
proaches to traffic classification, and highlights the motivations for emerging ML-based tech-
niques for IPTC.
The rest of this chapter is organised as follows. Section 2.2 justifies the importance of IPTC
when reviewing the important networking areas of QoS issues and provisioning, Internet pricing
and lawful interception. Section 2.3 follows with the introduction of a number of metrics for
assessing classification accuracy. Section 2.4 discusses the limitations of traditional port- and
payload-based classification techniques. This provides the basis for the motivation for statistical
and ML based traffic classification approaches discussed in section 2.5. Section 2.6 concludes
the chapter with some final remarks.
2.2 The importance of IP traffic classification
The importance of IPTC may be illustrated by reviewing the important areas of IP QoS issues
and provisioning, Internet pricing and Lawful Interception (LI).
2.2.1 QoS issues over Last Mile networks
Network capacity tends to be high in core (backbones) networks, low in access networks and
high in home or enterprise LANs. Consequently the edge (the boundary between ISP and
customer networks) tends to make a significant contribution to observed network queuing delay
and jitter [62]. I conducted a number of experimental studies to observe the degree to which
modern, ‘high bandwidth’ access technologies still introduce uncontrolled latency fluctuations
[63], [64] and [65]. I focused in particular on two common Internet access technologies: Data
Over Cable Service Interface Specifications (DOCSIS) [66] networks and 802.11b [67] wireless
local area networks.
A typical DOCSIS access network is illustrated in Figure 2.1. In this scenario, the home
user’s equipment (used for various activities, such as Web browsing, data and movie down-
loading, or playing interactive online games and chat) is connected to the remote content or
2.2. THE IMPORTANCE OF IP TRAFFIC CLASSIFICATION 31
game servers through the DOCSIS cable network of an ISP. Conceptually, the user’s traffic
travels through the user’s Cable Modem (CM), the Hybrid Fibre Coaxial Network (HFC) and
the Cable Modem Termination System (CMTS) at the ISP site, and the remote links.
VoIP
Online Game
Web, P2P, SSH, SMTP
Cable Modem
HFC Network
CMTS
DS
US
Web, P2P, SSH, SMTP Server
Game Server
ISP Home Network
Figure 2.1: A typical DOCSIS cable network from ISP to home users
I observed that when a client downloads content from an ISP-hosted server the DOCSIS
link exhibits a significant spike in latency that impacts on all traffic concurrently sharing the
DOCSIS link. (In my particular experiments [64], [63] the RTT jumped from 13ms when idle
to over 100ms during long-lived TCP-based data transfers from a remote server to a home-based
client 1.).
Wireless LAN networks have become popular for interactive applications such as online
gaming and videoconferencing. As with DOCSIS, I observed that consumer-grade 802.11b
networks exhibited latency fluctuations in excess of 100ms during long-lived TCP-based data
transfers [64] and [65].
These experiments confirmed my belief that modern access link technologies must deploy
traffic prioritisation mechanisms to effectively isolate different classes of end-user traffic from
each other. (With respect to my specific examples, better QoS control requires a CMTS, CM,
802.11 AP and/or 802.11 client that can discriminate between Internet applications, classes of
traffic and customers with different needs.)
2.2.2 QoS provisioning
In responding to the problem of network congestion, a common strategy for network providers
is under-utilising (over-provisioning) the link capacity. However, this is not necessarily an1The downstream (DS) and upstream (US) directions were capped to 2Mbps and 1Mbps respectively. This
approximated a consumer-grade cable-modem downlink while also ensuring the upstream ACK rate was not alimiting factor. Further characterisation of the increase in RTT as a function of offered load is presented in [64],[63]
32 CHAPTER 2. APPLICATION CONTEXT FOR ML BASED IPTC
economic solution for most ISPs. The Internet QoS solutions proposed over the last decade
can be classified into three broad categories: Internet QoS standards, industry QoS-enabled
products, and others.
A common requirement for these frameworks is an effective IPTC mechanism. An overview
of these frameworks provides the context for the use of IPTC in IP networks.
Internet QoS standards
The Integrated Services (IntServ) architecture [30] was the first major attempt to enhance the
Internet with QoS capabilities. It developed a new architecture for resource allocation, to meet
the requirements of real-time applications while preserving the datagram model of IP-based
networks. The basis of this approach is per-flow resource reservation. The resource reservation
protocol (RSVP) [68] has been developed as an end-to-end resource reservation set-up protocol
that maintains the reservation state inside the network [69].
The differentiated services (DiffServ) [31] architecture, unlike IntServ, does not provide a
complete solution for end-to-end QoS set-up or management. DiffServ defines only a set of
per-hop building blocks and a language in which to express per-hop forwarding behaviours.
Both IntServ and DiffServ rely on packet header inspection to map traffic to reserved re-
sources or forwarding behaviours (respectively) in each router along a path.
QoS-enabled solutions from industry
Since the early 2000s the telecommunications industry has introduced a number of QoS-enabled
products which can provide some QoS guarantees. For example, Ubicom Inc.’s ‘StreamEngine’
technology [70], and D-Link’s ‘GameFuel’ products [71] built upon that StreamEngine technol-
ogy, offer routers targeted specifically at providing QoS for multiplayer games and real-time,
interactive traffic applications. StreamEngine technology relies on local packet inspection to
classify packets into QoS classes for QoS provisioning and management.
Another example is Cisco Systems’ integration of AutoQoS features (e.g. for voice traffic
[72] ) into their high-end switches and routers. The technology combines traffic classification
with configuration of Differentiated Services across the network. Packets are classified based
on policies specified by the network operator, which are mostly based on the physical port,
2.2. THE IMPORTANCE OF IP TRAFFIC CLASSIFICATION 33
source or destination IP or MAC address, IP protocol type, or payload content. Marked packets
or flows are then tagged with a specific priority for treatment when they arrive at the Cisco
QoS-enabled device [72].
Allot Communications Ltd has developed a range of products called NetEnforcer to provide
QoS control and service level management in IP networks. The Allot NetEnforcer technique
relies on deep content packet inspection for traffic classification and control [73]. Priority queu-
ing is used to provide QoS. With NetEnforcer, each new connection flow gets its own queue
(per-flow queuing). The new queue is treated equally with other flows having the same priority
policy class.
In 2008, Exinda Networks [74] has introduced a range of products called Application Accel-
eration, and NetIntact [75] introduced its PacketLogic Generation 2 products. These products
also rely on packet content inspection for traffic identification, while rate shaping and priority
queuing are used to provide QoS.
Automated QoS solutions using traffic classification and priority systems
Distributed mechanisms for classifying and controlling traffic over access links are also being
explored. One example is ANGEL (Automated Network Games Enhancement Layer [35] [76],
itself an evolution of [33]). ANGEL provides for remote control of traffic differentiation in
customer modems or routers based on traffic classification occurring inside the ISP network.
The architecture of ANGEL is comprised of both ISP-side and CPE-side components. The
ISP-side components of ANGEL receive a copy of network traffic that is later classified, so as
to detect network game traffic. Once a game flow is detected 2, ANGEL informs individual
CPE devices of this identification (using the ANGEL ISP/CPE protocol) to allow flow prioriti-
sation at the CPE. ANGEL has been implemented using machine learning techniques for traffic
classification, building on my work presented in this thesis.
The role of IP traffic classification
All QoS schemes have some degree of IPTC implicit in their design. DiffServ assumes that
edge routers can recognise and differentiate between aggregate classes of traffic in order to
2Although the goal of ANGEL is to provide QoS for game traffic, its architecture can be used for other real-timetraffic as well.
34 CHAPTER 2. APPLICATION CONTEXT FOR ML BASED IPTC
set the DiffServ code point (DSCP) on packets entering the network core. IntServ presumes
that routers along a path are able to differentiate between finely grained traffic classes (and
historically has presumed the use of packet header inspection to achieve this goal).
Furthermore, real-time traffic classification is the core component of recent QoS-enabled
products [70] and automated QoS architectures [33] [35]. For example, with the StreamEngine
technology, one of the most important steps is to automatically classify the traffic passing
through the system to assign appropriate levels of priority [77]. The deployability of the ar-
chitecture in [33], [35] depends on the choice of the core components, including the traffic
classifier.
2.2.3 Internet pricing
The development of QoS solutions such as IntServ or DiffServ has been stymied in part due to
the lack of an effective service pricing mechanism (as suggested in [69] and [78]). A pricing
mechanism is needed to differentiate customers with different needs and charge for the QoS that
they receive. It would also act as a cost recovery mechanism and provide revenue generation
for the ISPs to compensate for their efforts in providing QoS and managing resource allocation.
Traffic classification has great potential to support a practical class-based Internet QoS charging
mechanism.
Finding a fairer and more efficient charging scheme for the Internet has attracted a signifi-
cant amount of research over the past decade. Works have included proposals for smart market
[79], shadow pricing [80], rate-based pricing [81], edge pricing [82], congestion discount [83],
zone-based cost sharing [84], Paris metro pricing [85], Tirupati pricing [86], priority pricing
[87] [88] [89], pricing for Integrated Services [38], Differentiated Services [90], pricing for re-
source negotiation [91] and pricing over congestion control [92] [93]. (While not central to this
thesis, I present a detailed review of these ideas in [37] and [36].)
Many pricing models have been proposed, aimed at an ideal pricing scheme which is able
to:
• Provide levels of services suited to different users with different needs.
• Charge users only for their perceived quality of service (QoS) and the resources they
consume.
2.2. THE IMPORTANCE OF IP TRAFFIC CLASSIFICATION 35
• Cope with the non-uniformity of Internet traffic with different QoS requirements.
• Enable ISPs to develop sustainable and profitable business models.
Most proposed Internet pricing models generally achieve a subset of the goals addressed
above. However, no particular solution has been widely implemented - the Internet is still dom-
inated by flat-rate pricing and simple usage charges (such as charging per volume of traffic or
connection duration). Exploring the issues associated with Internet pricing reveals that a prac-
tical solution needs to consider three important metrics, namely technical efficiency, economic
efficiency, and social impact [36] [37].
Technical efficiency refers to the costs associated with applying the new technology of a
particular pricing model or QoS provisioning scheme. Economic efficiency captures the impact
of a pricing scheme on network utilisation and the optimisation of a service provider’s revenue.
This dimension reflects the capability to accommodate new Internet services and valued cus-
tomers, and the maximisation of profit gained by charging customers’ traffic and QoS delivered.
Social impact concerns fairness for network users.
For most pricing schemes there is a distinct coupling and interrelationship between eco-
nomic efficiency, social impact and technical efficiency. Clearly, it is desirable to discover an
optimal pricing model in which economic efficiency, social impact and technical efficiency are
all concurrently maximised. However, in reality pricing models always tend to reveal a trade-off
between these three dimensions.
Most QoS provisioning schemes reviewed in section 2.2.2 try to provide QoS to end users
by differentiating among users on the basis of different needs/preferences or their different
types of applications. Different QoS treatment on the network is then provided accordingly. A
compatible Internet pricing model, therefore, needs to rely on an accurate classification of users
traffic, and charge the users for the QoS delivered.
Furthermore, as indicated in [36],[38],[39],[40], and [94], from the user’s perspective, the
most important requirements and expectations are the transparency, stability and predictability
of a pricing scheme, as well as the QoS provisioning mechanism. Probably most Internet users
have little or no interest in the underlying technologies or complicated ways by which their
applications or network are managed. Techniques which require user intervention and special
36 CHAPTER 2. APPLICATION CONTEXT FOR ML BASED IPTC
knowledge about the underlying technology are likely to be a hindrance to deployment. Users
should not have to signal their application’s identity to the underlying network through explicit
QoS preferences. Such tasks should be performed automatically by the network providers.
From an ISP perspective, implementation costs are critical and must not exceed the revenues
likely to be gained by introducing any new scheme. Network stability and reliability must
also be considered. ISPs resist deploying a complex technology if there are questions as to its
reliability or the operational effort required.
Accurate, automated traffic classification is an important component of any practical and
deployable QoS-based pricing scheme.
2.2.4 Lawful interception
There is an emerging requirement for ISP networks to provide Lawful Interception (LI) capabil-
ities, and traffic classification is an important solution in this regard [50] [95] [43]. Governments
typically implement LI at various levels of abstraction. In the telephony world a law enforce-
ment agency may nominate a ‘person of interest’ and issue a warrant for the collection of in-
tercept information. The intercept may be high-level call records (who called whom and when)
or low-level ‘tapping’ of the audio from actual phone calls in progress. In the ISP space, traffic
classification techniques offer the possibility of identifying traffic patterns (which end points
are exchanging packets and when), and identifying what classes of applications are being used
by a ‘person of interest’ at any given point in time (e.g. [96], [46], [44] and [97]). Depending on
the particular traffic classification scheme, this information may potentially be obtained without
violating any privacy laws covering the TCP or UDP payloads of the ISP customer’s traffic [45].
2.3 Traffic classification metrics
A key criterion on which to differentiate between classification techniques is predictive accuracy
(i.e., how accurately the technique or model makes decisions when presented with previously
unseen data). A number of metrics exist with which to express predictive accuracy.
2.3. TRAFFIC CLASSIFICATION METRICS 37
2.3.1 Positives, negatives, accuracy, precision and recall
Let us assume there is a traffic class X that we wish to identify, mixed with a broader set of IP
traffic. A traffic classifier is used to identify (classify) packets (or flows of packets) belonging to
class X when presented with a mixture of previously unseen traffic. The classifier is presumed
to give one of two outputs: a flow (or packet) is either believed to be a member of class X, or it
is not.
A common way to characterise a classifier’s accuracy is through metrics known as the per-
centage of False Positives, False Negatives, True Positives and True Negatives. These metrics
are defined as follows:
• False Negatives (FN): The number of members of class X incorrectly classified as not
belonging to class X.
False Negatives Percentage (FN%): The percentage of FN, among all members of class
X.
• False Positives (FP): The number of members of other classes incorrectly classified as
belonging to class X.
False Positives Percentage (FP%): The percentage of FP, among all members of other
classes.
• True Positives (TP): The number of members of class X correctly classified as belonging
to class X.
True Positives Percentage (TP%): The percentage of TP among all members of class X
(equivalent to 100% - FN%).
• True Negatives (TN): The number of members of other classes correctly classified as not
belonging to class X.
True Negatives Percentage (TN%): The percentage of TN, among all members of other
classes (equivalent to 100% - FP).
Figure 2.2 illustrates the relationships between FN, FP, TP and TN. A good traffic classifier
aims to minimise the False Negatives and False Positives.
38 CHAPTER 2. APPLICATION CONTEXT FOR ML BASED IPTC
X
X X
X
TP
FP TN
FN
Classified as
Figure 2.2: Evaluation Metrics
Some work in the literature makes use of Accuracy as an evaluation metric. It is generally
defined as the percentage of correctly classified instances among the total number of instances.
This definition is used throughout the thesis unless otherwise stated.
The ML literature often utilises two additional metrics known as Recall and Precision.
These metrics are defined as follows:
• Recall: Percentage of members of class X correctly classified as belonging to class X,
among all members of class X.
• Precision: Percentage of those instances that truly belong to class X, among all those
classified as class X.
Except for FN, FP, TP and TN, all other metrics are considered to range from 0 (very poor)
to 100% (optimal). It can be seen that Recall is equivalent to TP%.
With regards to Figure 2.2, Recall and Precision are defined as follows:
Recall = T PT P+FN Precision = T P
T P+FP
Though all these metrics can be appropriate to evaluate a classifier, it is important to realise
that the overall accuracy metric might not be the best metric (sometimes even misleading) to re-
flect the classifier’s performance. This can be the case when there is a great imbalance between
traffic classes’ population sizes, for example, if class X contains only a single member, while
non-class X (X̄) contains 99 members. If the classifier falsely classifies the class X’s member
as X̄ , and truly classifies 99 X̄ members as X̄ , then overall accuracy is very high (99%) while
actually the FN% is 100% (or Recall is 0%). If we want to identify members of class X, the
accuracy per class X (e.g. FN% and FP% or Recall and Precision) is more important than the
overall accuracy of the classifier.
In this thesis I focus on the Recall and Precision metrics commonly used in the ML literature,
as they summarise well the performance of the classifier per class. It is also important to note
2.3. TRAFFIC CLASSIFICATION METRICS 39
that high Precision is only meaningful when the classifier has achieved good Recall and vice
versa.
2.3.2 Byte and flow accuracy
When comparing the literature on different classification techniques it is also important to note
the unit of the author’s chosen metric. Recall, Precision, FN and FP may all be reported as
percentages of bytes or flows relative to the traffic being classified. An author’s choice here can
significantly alter the meaning of the reported accuracy results.
Most recently published traffic classification studies have focused on flow accuracy – mea-
suring the accuracy with which flows are correctly classified, relative to the number of other
flows in the author’s test and/or training dataset(s). However, some recent work has also chosen
to express accuracy calculations in terms of byte accuracy – focusing more on how many bytes
are carried by the packets of correctly classified flows, relative to the total number of bytes in
the author’s test and/or training dataset(s) (e.g. [53] and [98]).
Erman et al. in [99] argue that byte accuracy is crucial when evaluating the accuracy of
traffic classification algorithms. They note that the majority of flows on the Internet are small
and account for only a small portion of total bytes and packets in the network (mice flows).
On the other hand, the majority of traffic bytes are generated by a small number of large flows
(elephant flows). They provide an example from a six-month data trace which found the top
(largest) 1% of flows accounted for over 73% of the traffic in terms of bytes. With a threshold to
differentiate elephant and mice flows of 3.7MB, the top 0.1% of flows would account for 46%
of the traffic (in bytes). Presented with such a dataset, a classifier optimised to identify all but
the top 0.1% of the flows could attain a 99.9% flow accuracy but still result in 46% of the bytes
in the dataset being misclassified.
Whether flow accuracy or byte accuracy is more important will generally depend on the
classifier’s intended use. For example, when classifying traffic for IP QoS purposes it is plau-
sible that identifying every instance of a short-lived flow needing QoS (such as five-minute,
32Kbit/sec phone calls) is as important as identifying long-lived flows needing QoS (such as
a 30 minute, 256Kbit/sec video conference), with both being far more important to correctly
identify than the few flows that represent multi-hour (and/or hundreds of megabytes) peer-to-
40 CHAPTER 2. APPLICATION CONTEXT FOR ML BASED IPTC
peer file sharing sessions. Conversely, an ISP undertaking analysis of load patterns on their
network may well be significantly interested in correctly classifying the applications driving the
elephant flows that contribute a disproportionate number of packets across their network.
In this thesis, I focus on IPTC to support QoS solutions. Hence, I use flow accuracy to
evaluate the performance of a classifier under test.
2.4 Limitations of packet inspection for traffic classification
Traditional IPTC relies on the inspection of a packet’s TCP or UDP port numbers (port-based
classification), or the reconstruction of protocol signatures in its payload (payload-based clas-
sification). Each approach suffers from a number of limitations.
2.4.1 Port-based IP traffic classification
TCP and UDP provide for the multiplexing of multiple flows between common IP end points
through the use of port numbers. Historically many applications utilise a ‘well-known’ port
on their local host as a rendezvous point to which other hosts may initiate communication. A
classifier sitting in the middle of a network need only look for TCP SYN packets (the first step
in a TCP’s three-way handshake during session establishment) to know the server side of a new
client-server TCP connection. The application is then inferred by looking up the TCP SYN
packet’s target port number in the Internet Assigned Numbers Authority’s (IANA) list of reg-
istered ports[100]. UDP uses ports in a similar way, though without connection establishment
nor the maintenance of connection state.
However, this approach has limitations. Firstly, some applications may not have their ports
registered with IANA (for example, peer-to-peer applications such as Napster and Kazaa) [61].
An application may use ports other than its well-known ports to avoid operating system access
control restrictions (for example, non-privileged users on Unix-like systems may be forced to
run HTTP servers on ports other than port 80). Also, in some cases server ports are dynamically
allocated as needed. For example, the RealVideo streamer allows the dynamic negotiation of
the server port to be used for the data transfer. This server port is negotiated on an initial TCP
connection, which is established using the well-known RealVideo control port [101].
Moore and Papagiannaki [102] observed no better than a 70% byte accuracy for port-based
2.4. LIMITATIONS OF PACKET INSPECTION FOR TRAFFIC CLASSIFICATION 41
classification using the official IANA list. Madhukar and Williamson [103] showed that port-
based analysis was unable to identify 30-70% of the Internet traffic flows they investigated. Sen
et al. [52] reported that the default port accounted for only 30% of the total traffic (in bytes) for
the Kazaa P2P protocol.
In some circumstances IP layer encryption may also obfuscate the TCP or UDP header,
making it impossible to know the actual port numbers.
2.4.2 Payload-based IP traffic classification
To avoid total reliance on the semantics of port numbers, many current industry products utilise
stateful reconstruction of session and application information from each packet’s content.
Sen et al. [52] demonstrated that payload-based classification of P2P traffic (by examin-
ing the signatures of the traffic at the application level) could reduce false positives and false
negatives to 5% of total bytes for most P2P protocols studied.
Moore and Papagiannaki [102] use a combination of port- and payload-based techniques
to identify network applications. The classification procedure starts with the examination of
a flow’s port number. If no well-known port is used, the flow is passed through to the next
stage. In the second stage, the first packet is examined to see whether it contains a known
signature. If one is not found, then the packet is examined to see whether it contains a well-
known protocol. If these tests fail, the protocol signatures in the first KByte of the flow are
studied. Flows that remain unclassified after that stage require inspection of the entire flow
payload. Their results show that port information by itself is capable of correctly classifying
69% of the total bytes. Furthermore, including the information observed in the first KByte of
each flow increases the accuracy to almost 79%. Higher accuracy (up to nearly 100%) can only
be achieved by investigating the remaining unclassified flows’ entire payload.
Although payload-based inspection avoids reliance on fixed port numbers, it imposes sig-
nificant complexity and a substantial processing load on the traffic identification device. It must
be kept up-to-date with extensive knowledge of application protocol semantics, and must be
powerful enough to perform concurrent analysis of a potentially large number of flows. This
approach can be difficult or impossible when dealing with proprietary protocols or encrypted
traffic. Furthermore, direct analysis of session and application layer content may represent a
42 CHAPTER 2. APPLICATION CONTEXT FOR ML BASED IPTC
breach of organisational privacy policies or a violation of relevant privacy legislation.
2.5 Classification based on statistical traffic properties
The preceding techniques are limited by their dependence on the inferred semantics of the infor-
mation gathered through deep inspection of packet content (payload and port numbers). Newer
approaches rely on the traffic’s statistical characteristics to identify the application. An assump-
tion underlying such methods is that traffic at the network layer has statistical properties (such
as the distribution of flow duration, flow idle time, packet inter-arrival time and packet lengths)
that are unique for certain classes of applications which enables different source applications to
be distinguished from each other.
The relationship between the class of traffic and its observed statistical properties has been
noted in [104] where the authors analysed and constructed empirical models of connection
characteristics - such as bytes, duration and arrival periodicity - for a number of specific TCP
applications, and in [105] where the authors analysed Internet chat systems by focusing on the
characteristics of the traffic in terms of flow duration, packet inter-arrival time and packet size
and byte profile. Later work (for example [106], [107] and [108]) also observed distinctive
traffic characteristics, such as the distributions of packet lengths and packet inter-arrival times,
for a number of Internet applications. The results of these studies have stimulated new classifi-
cation techniques based on the statistical properties of traffic flow. The need to deal with traffic
patterns, large datasets and multi-dimensional spaces of flow and packet attributes is one of the
reasons for the introduction of ML techniques into this field.
2.6 Conclusion
In this chapter, I have discussed the negative impacts of the Last Mile bottleneck on QoS de-
livery to sensitive applications. The results showed that QoS problems in access links cannot
be solved simply by providing larger amounts of uncontrolled bandwidth. This suggests the
need for an effective QoS control and traffic prioritising system to overcome the problem. For
implementing a number of proposed QoS architectures, a real-time automated IPTC is a crucial
component.
IPTC also plays an important role in the areas of Internet pricing and lawful interception.
2.6. CONCLUSION 43
An automated, real-time and low-cost IP traffic classifier operating at the network layer may
be a useful tool for a simple class-based Internet pricing and billing system. This may offer a
solution that satisfies the deployment requirements of both ISPs and their customers. A robust
IPTC scheme also has the great potential to provide a non-intrusive solution for ISPs to satisfy
government LI requirements.
I have also demonstrated that the commonly deployed approaches, such as port-based or
deep packet inspection techniques, have been diminishing in effectiveness. The new direction
of classifying traffic by learning and recognising statistical patterns in externally observable
attributes of the traffic (such as packet lengths and inter-packet arrival times), therefore, is be-
coming a potential future solution. This provides the application context for Machine Learning-
based IPTC techniques.
In the next chapter, a brief background on Machine Learning and its application in the IPTC
field will be presented. I will also address a number of requirements for a deployable machine
learning based IP traffic classifier in an operational network. This acts as a guideline to my
novel proposal for a practical, real-time, automated IP traffic classifier, presented in Chapter 5
and Chapter 7.
Chapter 3
A Brief Background on Machine Learningand its Application to IP TrafficClassification
Machine Learning (ML) has long been known as a powerful technique for data mining and
knowledge discovery, which searches for and describes useful structural patterns in data. ML
has a great range of applications, including in relation to search engines, medical diagnosis, text
and handwriting recognition, image screening, load forecasting, marketing and sales diagnosis
[109].
This chapter summarises the basic concepts of ML and outlines how ML can be applied to
IPTC. It also discusses a number of key requirements for the employment of ML-based classi-
fiers in operational IP networks, which act as guidelines for my research on a novel, practical,
real-time, automated and deployable ML-based IP traffic classifier.
3.1 A review of classification with Machine Learning
In 1992 Shi [110] noted, ‘One of the defining features of intelligence is the ability to learn...
Machine learning is the study of making machines acquire new knowledge, new skills, and
reorganise existing knowledge’. A learning machine has the ability to learn automatically from
experience and refine and improve its knowledge base. In 1983 Simon noted, ‘Learning denotes
changes in the system that are adaptive in the sense that they enable the system to do the same
task or tasks drawn from the same population more efficiently and more effectively the next
time’ [111]; and in 2000 Witten and Frank observed, ‘Things learn when they change their
44
3.1. A REVIEW OF CLASSIFICATION WITH MACHINE LEARNING 45
behavior in a way that makes them perform better in the future’ [109].
As mentioned previously, ML has a wide range of applications. The use of ML techniques
by network traffic controllers was proposed in 1990, aiming to maximise call completion in
a circuit-switched telecommunications network [112]; this was one of the studies that marked
the point at which ML techniques expanded their application space into the telecommunications
networking field. In 1994 ML was first utilised for Internet flow classification in the context of
intrusion detection [41]. This was the starting point for much of the work on ML techniques in
Internet traffic classification that followed.
3.1.1 Input and output of an ML process
ML takes input in the form of a dataset of instances (also known as examples). An instance
refers to an individual, independent example of the dataset. Each instance is characterised by the
values of its features (also known as attributes or discriminators) that measure different aspects
of the instance. (In the networking field consecutive packets from the same flow might form an
instance, while the set of features might include median inter-packet arrival times or standard
deviation of packet lengths over a number of consecutive packets in a flow.) The dataset is
ultimately presented as a matrix of instances versus features [109]. An example dataset is
illustrated in Figure 3.1. Within a dataset, the same group of features must be used to describe
any instances; nonetheless the values of the features can vary among the individual instances.
Feature 1 (e.g. Mean packet
length)
Feature 2 (e.g. Mean packet inter -
arrival time)
Feature K … Class
F low instance 1 A 1 B 1 … Game
F low instance 2 A 2 B 2 … Other
… … … … F low instance
N - 1 A n - 1 B n - 1 … Other
F low instance N
A n B n … Game
Figure 3.1: An example dataset as a matrix of instances versus features
The output is the description of the knowledge that has been learnt. How the specific out-
come of the learning process is represented (the syntax and semantics) depends largely on the
particular ML approach being used.
46 CHAPTER 3. A BRIEF BACKGROUND ON MACHINE LEARNING
3.1.2 Different types of learning
Witten and Frank [109] define four basic types of learning:
• Classification (or supervised learning)
• Clustering (or unsupervised learning)
• Association
• Numeric prediction
Classification learning involves a machine learning from a set of pre-classified (also called pre-
labelled) examples, from which it builds a set of classification rules (a model) to classify unseen
examples. Clustering is the grouping of instances that have similar characteristics into clusters,
without any prior guidance. In association learning, any rules that strongly relate different
features’ values are sought (not only those that relate features’ values and class). In numeric
prediction, the outcome to be predicted is not a discrete class but a numeric quantity.
Most ML techniques used for IPTC focus on the use of supervised and unsupervised learn-
ing.
3.1.3 Supervised learning
Supervised learning creates knowledge structures that support the task of classifying new in-
stances into pre-defined classes [113]. The learning machine is provided with a collection of
sample instances, pre-classified into classes. Output of the learning process is a classification
model that is constructed by examining and generalising from the provided instances.
In effect, supervised learning focuses on modelling the input/output relationships. Its goal
is to identify a mapping from input features to an output class. The knowledge learnt (such as
commonalities among members of the same class and differences between competing ones) can
be presented, for example, as a flowchart, a decision tree or classification rules, which can be
used later to classify a new unseen instance.
There are two major phases (steps) in supervised learning:
• Training: The learning phase that examines the data provided (called the training dataset)
and constructs (builds) a classification model.
3.1. A REVIEW OF CLASSIFICATION WITH MACHINE LEARNING 47
• Testing (also known as classifying): The model built in the training phase is used to
classify new, previously unseen instances.
For example, let C be a discrete set {y1,y2, ...,yM} that consists of all the pre-defined classes.
A number of instances are selected for each class y j(1 ≤ j ≤M) to train the classifier. Let TS
be a training dataset, that is a set of input/output pairs,
TS = {< x1,y1 >,< x2,y1 >,...,< xN−1,YM >,< xN,yM >}
where xi is the vector of values of the input features corresponding to the ith instance, and y j
is its output class value. The goal of classification can be formulated as follows: from a training
dataset TS, find a function f (x) of the input features that best predicts the outcome of the output
class y for any new unseen values of x (such as a minimum probability of error). The function
f (x) is the core of the classification model.
As an ML principle, training data should have the same characteristics as the data to be
classified. Also, the model created during training is improved if we simultaneously provide
examples of instances that belong to class(es) of interest and instances known to not be mem-
bers of the class(es) of interest. This allows the ML algorithm to compare and contrast, and
to generalise the classification rules in distinguishing between the instances belonging to the
class(es) of interest and the others. This will enhance the model’s performance later in the
classification of new, previously unknown instances [110].
There exists a number of supervised learning classification algorithms, each differing mainly
in terms of the way the classification model is constructed and the optimisation algorithm used
to search for a good model. In this thesis I make use of two different learning algorithms: the
C4.5 Decision Tree [114] and supervised Naive Bayes algorithms [115].) Brief descriptions of
these two algorithms are presented in the sub-sections below.
The Naive Bayes algorithm
The Naive Bayes algorithm provides a simple approach to classification based on probabilistic
knowledge [115]. The method is designed for use in supervised classification, in which the goal
is to predict accurately the class of unseen data using the classification model built on training
instances.
Let C be the random variable denoting the class of an instance and let X = {X1,X2, ...Xn} be
48 CHAPTER 3. A BRIEF BACKGROUND ON MACHINE LEARNING
a vector of random variables denoting the observed attribute values. Let c be a particular class
and x be an instance to be classified. The algorithm makes a statistical conclusion about the
probability of instance x belonging to a class c, based on:
• the conditional probability of observing the occurrence of each class’s instance in the
training set (the so-called posterior probability, denoted by P(C=c))
• and the probability for the instance x given c.
The calculation follows Bayes’s rule:
p(C = c|X = x) =P(C = c)P(X = x|C = c)
p(X = x)(3.1)
Based on the outcome the classifier then predicts the most probable class. The Naive Bayes
algorithm relies on two assumptions: Firstly, it assumes that the instance’s attributes are inde-
pendent given the class, and that no hidden or latent attributes influence the prediction process
[115]. The second assumption is that within each class, the values of numeric attributes are
Normally (or Gaussian) distributed, so that the attribute’s value distribution can be represented
in terms of its mean and standard deviation, and the probability of an observed value can be
easily computed from such estimates.
In equation 3.1 X = x represents the event that X1 = x1 ∧X2 = x2 ∧ ...Xn = xn. With the
assumption that these attribute values are independent, one obtains:
P(X = x|C = c) = ∏i
P(Xi = xi|C = c) (3.2)
Generally the denominator of equation 3.1 is not directly estimated as it can be simply
considered to be a normalising factor [115].
Both independence and normality assumptions are violated in many cases 1. However, this
approach has been shown to work better than more complex methods and it also can cope with
complex situations [109].
1The normality assumption is violated in my ET datasets used in Chapter 5 as well. However, my experimentsshow that the Naive Bayes classifier trained using SSP-ACT still performs well in identifying ET traffic.
3.1. A REVIEW OF CLASSIFICATION WITH MACHINE LEARNING 49
The C4.5 Decision Tree algorithm
C4.5 is one of the most commonly used algorithms that deploy decision trees for classification.
It has a history dating back to the 1960s with the work of Hunt et al [116]. The attractiveness
of this algorithm is that, in contrast to the Naive Bayes algorithm, it represents rules that can be
easily understood by humans. In addition, there are no a priori assumptions about the nature of
the data needed.
This classification method is built based on the form of a tree structure, where each node is
either a leaf node that represents the class or a test node that specifies some test to be carried out
on a single attribute value that has two or more outcomes (branches), each linked to a sub-tree.
An instance can be classified by starting at the root of the tree and following the path until it
reaches a leaf node, which provides the classification of the example.
To construct the tree, C4.5 (similarly to other decision tree classifiers) uses a method known
as ‘divide and conquer’ that employs a top-down, greedy search through the space of possible
decision trees from a set of training instances. For optimal tree construction C4.5 selects the
attribute to test at each test node in the tree that maximises a heuristic splitting criterion. One
criterion used in the algorithm is information gain measurement. A detailed description for
calculating this factor is described in [114].
The divide and conquer algorithm partitions the data until every leaf contains cases of a
single class, or until further partitioning is impossible because two cases have the same values
for each attribute but belong to different classes [114]. While this is sometimes a reasonable
strategy, it can lead to a loss of predictive accuracy in most applications if there is noise 2 in the
training data, or when the number of training examples is too small to produce a representative
sample of the true target application class. In other words, this simple algorithm can produce
trees that over-fit the training dataset. There are two approaches to overcoming the problem.
The first approach tries to stop the growing of the tree before it reaches the point where it
perfectly classifies the training data. The second one allows the tree to over-fit the data, and
then post prunes the tree. The latter seems to be more successful in practice - and is employed
2Noise can be a random error or variance in a measured variable [117], (in our case, it can be an error in exam-ples’ feature values in the training dataset, such as variability in packet inter-arrival times caused by congestion,or changes in MTU sizes caused by alternate paths between sender and receiver). Noise can also be an error in theclass that is assigned to an example in the training dataset [109].
50 CHAPTER 3. A BRIEF BACKGROUND ON MACHINE LEARNING
by the C4.5 algorithm [118] [114].
Despite the advantages mentioned above, classification based on decision trees has a num-
ber of limitations. It is unstable - small variations in the training data can result in different
attribute selections at each test point within the tree, and large changes in the classification rules
[119] [118] [120].
Furthermore, trees created from numeric datasets can be quite complex since attribute splits
for numeric data are binary. The process of growing a decision tree is also computationally
expensive. At each node, each candidate splitting field must be sorted before its best split can
be found. Pruning algorithms can also be expensive since many candidate sub-trees must be
formed and compared [118] [120].
3.1.4 Clustering
Classification techniques use pre-defined classes of training instances. In contrast, clustering
methods are not provided with this guidance; instead, they discover natural clusters (groups) in
the data using internalised heuristics [121].
Clustering focuses on finding patterns in the input data. It clusters instances with similar
properties (defined by a specific distance measuring approach, such as Euclidean space) into
groups. The groups that are so identified may be exclusive, so that any instance belongs in only
one group; or they may be overlapping, where one instance may fall into several groups; they
may also be probabilistic, such that an instance belongs to a group with a certain probability.
They may be hierarchical, where there is a division of instances into groups at the top level, and
then each of these groups is refined further - even down to the level of individual instances [109].
There are three basic clustering methods: the classic k-means, incremental clustering, and
the probability-based clustering. The classic k-means forms clusters in numeric domains, par-
titioning instances into disjoint clusters, while incremental clustering generates a hierarchical
grouping of instances. The probability-based methods assign instances to classes probabilisti-
cally, not deterministically [109].
To be used in classification intermediate steps are required to label the resulting clusters and
generate rules from the clusters for future classification. Generally, ‘labelling’ is the process
of classifying the members of a dataset using manual (human) inspection or an irrefutable au-
3.1. A REVIEW OF CLASSIFICATION WITH MACHINE LEARNING 51
tomated process. A common method is to label a cluster according to the member class that
contributes the most to the cluster’s population. Rules created from a cluster can be a paramet-
ric model to assign a new flow to a cluster. For example, as in [122] and [123] the Euclidean
distance between the new flow and the centre of each pre-defined cluster is computed, and the
new flow belongs to the cluster for which the distance is the least.
The Expectation Maximisation (EM) algorithm is one of the probabilistic clustering meth-
ods. It assigns a data point to each cluster with a certain probability. The underlying statistical
model of EM is a finite mixture. Each mixture is a set of probability distributions - one for each
cluster - that models the attribute values for members of that cluster. The algorithm starts with
initial guesses for the parameters for each cluster, uses them to calculate the cluster probabil-
ities for each instance, then uses these probabilities to re-estimate the parameters, and repeats
until convergence is attained [109]. EM has been used to cluster IP traffic flows in previous
work, such as [59] and [60]. Since the algorithm is used in this thesis, a brief description of the
algorithm is presented in a subsection below.
EM algorithm
The EM algorithm is introduced when both the distribution and the parameters that characterise
a mixture model are unknown. It adopts the procedure used for the k-means clustering algor-
ithm. The method starts with initial guesses for the unknown parameters, uses them to calculate
the cluster probabilities for each instance (‘expectation’), uses these probabilities to re-estimate
the parameters, and repeats until convergence (‘maximisation’) [109].
Let y be a random vector whose joint density f (y;θ) is indexed by a p-dimensional param-
eter θ in Θ. If the complete-data vector y is observed, it is of interest to compute the maximum
likelihood estimate of q based on the distribution of y. The log-likelihood function of y:
logL(θ ;y) = l(θ ;y) = log( f (y;θ)) (3.3)
is then required to be maximised. If θ (0) is an initial value for θ , then on the first iteration it
is necessary to compute:
Q(θ ,θ (0)) = E(0)θ
[l(θ (0);y)] (3.4)
52 CHAPTER 3. A BRIEF BACKGROUND ON MACHINE LEARNING
Q(θ ,θ (0)) is now maximised with respect to θ , that is θ (1) is found such that
Q(θ (1),θ (0)) > Q(θ ,θ (0)) (3.5)
for all θ in Θ.
Thus the EM algorithm consists of an E-step (Estimation step) followed by an M-step (Max-
imisation step) defined as follows: E-step: Compute Q(θ ,θ (t)) where
Q(θ ,θ (t)) = E(t)θ
[l(θ ;y|θ (t)] (3.6)
M-step: Find θ (t+1) such that
Q(θ (t+1),θ (t)) > Q(θ ,θ (t)) (3.7)
for all θ in Θ.
The E-step and the M-step are repeated alternately until the increase in the log-likelihood
is less than ε , where ε is a prescribed small quantity (that can be considered negligible). The
EM algorithm guarantees convergence to a local maximum. To obtain the global maximum,
the whole procedure should be repeated several times, with different initial guesses for the
parameter values. The model that provides the highest local maxima should be chosen as the
best [109].
Another issue in this process involves choosing the number of clusters to model. If it is not
known in advance then the process is modelling 1, 2, 3, ... clusters. The number of clusters are
increased until the log-likelihood is less than ε . Some implementations of the EM algorithm
(e.g. WEKA [124]) include an option to allow the number of clusters to be found automatically.
Beginning with one cluster, it continues to add clusters until the estimated log-likelihood is no
longer increased [125].
3.1.5 Evaluating supervised learning algorithms
A good ML classifier optimises Recall and Precision. However, there may be trade-offs between
these metrics. To decide which one is more important or should be given higher priority one
needs to take into account the cost of making wrong decisions or wrong classifications. The de-
cision must depend on the specific application context and one’s commercial and/or operational
priorities.
3.1. A REVIEW OF CLASSIFICATION WITH MACHINE LEARNING 53
Various tools exist to study this trade-off, and thus support this decision-making process.
The receiver operating characteristic (ROC) curve provides a way to visualise the trade-offs
between TP and FP by plotting the number of TP as a function of the number of FP (both
expressed as percentage of the total TP and FP respectively). This has been found useful in
analysing how classifiers perform over a range of threshold settings [109]. Another tool is the
Neyman-Pearson criterion [126], which attempts to maximise TP given a fixed threshold on FP
[127].
A challenge when using supervised learning algorithms is that both the training and test-
ing phases must be performed using datasets that have been pre-labelled 3. Ideally one would
have a large training set (for optimal learning and creation of models) and a large, yet indepen-
dent, testing dataset to assess the algorithm’s performance. (Testing on the training dataset is
usually misleading. Such testing will usually only show that the constructed model is good at
recognising the instances from which it was constructed.)
In the real world we are often faced with a limited quantity of pre-labelled datasets. A
simple procedure (sometimes known as holdout [109]) involves setting aside some part (e.g.
two thirds) of the pre-labelled dataset for training, and the rest (e.g. one third) for testing.
In practice, when only small or limited datasets are available a variant of holdout, called
N-fold cross-validation, is most commonly used. The dataset is first split into N approximately
equal partitions (or folds). Each partition (1/N) in turn is then used for testing, while the
remainder ((N− 1)/N) are used for training. The procedure is repeated N times so that in the
end, every instance has been used exactly once for testing. The overall Recall and Precision
are calculated from the average (mean value) of the Recalls and Precisions measured from all N
tests. The results therefore do not apply to a particular classifier among those tested, but can be
considered as an estimation for a classifier being trained on the whole dataset [128][129]. It has
been claimed that N = 10 (10-fold cross-validation) provides a good estimate of classification
performance [109].
Simply partitioning the full dataset N ways does not guarantee that the same proportion is
used for any given class within the dataset. A further step, known as stratification, is usually
applied - randomly sampling the dataset in such a way that each class is properly represented3In contrast to a controlled training and testing environment, operational classifiers do not have access to pre-
viously labelled example flows.
54 CHAPTER 3. A BRIEF BACKGROUND ON MACHINE LEARNING
in both training and testing datasets. When stratification is used in combination with cross-
validation, it is called stratified cross-validation. It is common to use stratified 10-fold cross-
validation when only limited pre-labelled datasets are available.
3.1.6 Evaluating unsupervised learning algorithms
While Recall and Precision are common metrics to evaluate classification algorithms, evaluating
clustering algorithms is more complicated. There are intermediate steps required in evaluating
the resulting clusters before labelling them or generating rules for future classification. Given a
dataset, a clustering algorithm can always generate a division, with its own finding of structure
within the data. Different approaches can lead to different clusters, and even for the same
algorithm, different parameters or different orders of input patterns might alter the final results
[130] [131].
Therefore, it is important to have effective evaluation standards and criteria to provide the
users with a certain level of confidence in results generated by a particular algorithm, or com-
parisons of different algorithms [132]. Criteria should help to answer useful questions such
as: how many clusters are hidden in the data; what are the optimal number of clusters [131];
whether the resulting clusters are meaningful or just an artifact of the algorithms [132]; how
one algorithm performs compared to another – how easy they are to use, how fast it is to be em-
ployed [130]; what is the intra-cluster quality; how good is inter-cluster separation; what is the
cost of labelling the clusters and what are the requirements in terms of computer computation
and storage.
Halkidi et al.[131] identify three approaches to investigating cluster validity: external cri-
teria, internal criteria and relative criteria. The first two approaches are based on statistical
hypothesis testing. The external criteria approach is based on some pre-specified structure,
which is known as prior information on the data and is used as a standard to compare and val-
idate the clustering results [132]. The internal criteria approach evaluates the clustering result
of an algorithm based on examining the internal structure inherited from the dataset. The rela-
tive criteria approach emphasises finding the best clustering scheme that a clustering algorithm
can define under certain assumptions and parameters. The basic idea is to evaluate a clustering
structure by comparing it to others that use the same algorithm but with different parameter
3.1. A REVIEW OF CLASSIFICATION WITH MACHINE LEARNING 55
values [133]. (More details on these approaches can be found in [131], [132], [130] and [109].)
3.1.7 Feature selection algorithms
A key to building an ML classifier is identification of the smallest necessary set of features
required to achieve one’s goals in relation to accuracy - a process known as feature selection.
The quality of the feature set is crucial to the performance of an ML algorithm. Using
irrelevant or redundant features often leads to negative impacts on the accuracy of most ML
algorithms. It can also make the system more computationally expensive, as the amount of
information stored and processed rises with the dimensionality of a feature set used to describe
the data instances. Consequently it is desirable to select a subset of features that is small in size
yet retains essential and useful information about the classes of interest.
Feature selection algorithms can be broadly classified into filter methods or wrapper meth-
ods. Filter method algorithms make independent assessments based on the general characteris-
tics of the data. They rely on a certain metric to rate and select the best subset before learning
commences. The results provided therefore should not be biased toward a particular ML al-
gorithm. Wrapper method algorithms, on the other hand, evaluate the performance of different
subsets using the ML algorithm that will ultimately be employed for learning. Its results are
therefore biased toward the ML algorithm used. A number of subset search techniques can be
used, such as Correlation-based Feature Selection (CFS) filter techniques with a Greedy, Best-
First or Genetic search. (Additional details on these techniques can be found in [109], [134],
[135], [136] and [137].)
3.1.8 Imbalanced datasets problem
A common assumption for machine learning classification is that the participating classes share
similar prior probabilities, possess a similar percentage of examples in the dataset. However,
this assumption is normally violated in real-world problems, for example, in network intrusion
detection, and fraud and anomaly detection. It is often the case that the ratios of prior probabili-
ties between classes are significantly skewed. For example, there may be a ‘majority class’ that
greatly outweighs a ‘minority class’ in terms of number of examples. This problem is referred
to as inter-class imbalance.
56 CHAPTER 3. A BRIEF BACKGROUND ON MACHINE LEARNING
Another problem of imbalance is within a single class, normally referred to as intra-class
imbalance. This occurs when the members of a class are under-represented compared to other
members of the same class drawn from different distributions.
These imbalances may have a negative impact on the performance of standard classification
algorithms, such as the C4.5 Decision Tree algorithm, which normally aim to maximise the
overall classification accuracy. When dealing with unbalanced datasets, these algorithms may
result in classifiers that ignore the minority class or classifiers that over-fit the training data
[138][139].
Several methods have been proposed to deal with the problem of inter-class imbalance,
including re-sampling the training datasets (e.g. [140] and [141]), adjusting misclassification
costs (e.g. [142]), and learning from the minority class (e.g. [143]). Among them, re-sampling
appears to be a reasonably effective approach [144].
Re-sampling is the process of changing the prior probabilities of the majority and minority
classes in the training set by changing the number of examples in the majority and minority
classes [144]. In particular, over-sampling duplicates the minority examples in the training set.
While this helps with balancing the dataset, it does not increase the amount of information, and
may lead to over-fitting [139] [138]. In under-sampling, examples are reduced in the majority
class. While this does increase the balance of prior probabilities between classes, it results in a
loss of information that may be useful in building an accurate classification model.
Nikerson and Milios [145] propose a solution that addresses both inter-class and intra-class
imbalance problems. Their approach firstly clusters the minority class and the majority class
separately. Cluster memberships are then examined and re-sampled based on the number of
examples per cluster instead of the number of examples per class.
The authors of [144] point out that past studies have not reached any conclusive results
with regards to whether under-sampling or over-sampling is best to optimise classification per-
formance. Most likely, conflicting results are due to the combination of specific datasets and
classification algorithms. Yet under-sampling has the advantage of requiring less training time
and physical resources compared to the over-sampling method.
Another notable point is that there has not been a solution when the training dataset and
testing dataset have different balancing characteristics. The training data may be balanced but
3.2. THE APPLICATION OF ML IN IP TRAFFIC CLASSIFICATION 57
the testing may not and vice versa. Studies such as [146] and [147] have shown that a balanced
class distribution is not always the best for learning, and in some cases naturally occurring class
distribution is shown to perform well.
Besides the issue of an imbalanced ratio, a minority class can also create the problem of a
lack of information. The extent to which an algorithm suffers from an imbalanced ratio may
be different from one algorithm to another; however, all such algorithms will suffer from a lack
of examples presented for training [139]. As in the example provided in [139], for a dataset
consisting of 5:95 minority:majority examples the imbalanced ratio is the same as in a dataset
of 50:950. However, in the first case the minority class is poorly represented and thus suffers
more from the lack of information problem than in the second case. Therefore, when the impact
of the imbalance ratio on a learning algorithm is unclear, it is more important to gather as many
examples from the minority class as possible (under-sampling should not be performed on a
minority class).
More information about this issue can be found in [148], [149] and [139].
3.2 The application of ML in IP traffic classification
A number of general ML concepts take a specific meaning when applied to IPTC. For the
purpose of the subsequent discussion I define the following three terms relating to flows:
• Flow or Uni-directional flow: A series of packets that share the same five-tuple: source
and destination IP addresses, source and destination IP ports and protocol number.
• Bi-directional flow: A bi-directional flow is a pair of uni-directional flows, one in each
direction between the same source and destination IP addresses and ports 4.
• Full-flow: A bi-directional flow captured over its entire lifetime, from the establishment
to the end of the communication connection.
A class usually indicates the IP traffic caused by (or belonging to) an application or group
of applications. Instances are usually multiple packets belonging to the same flow. Features are
typically numerical attributes calculated over multiple packets belonging to individual flows.
4In asymmetric routing, server-to-client packets may take a different path to client-to-server packets - the trafficcapture point needs to be located where it can see packets in both directions for the use of bi-directional flow.
58 CHAPTER 3. A BRIEF BACKGROUND ON MACHINE LEARNING
Examples include mean packet lengths, standard deviation of inter-packet arrival times, total
flow lengths (in bytes and/or packets), Fourier transform of packet inter-arrival time, and so on
[150]. As previously noted not all features are equally useful, so practical ML classifiers choose
the smallest set of features that lead to efficient differentiation between members of a class and
other traffic outside the class.
Internet applications’ traffic is often bi-directional. For example its flows consist of data
and acknowledgements, requests and replies, and commands and feedback, separately in one
direction and the other. Hence bi-directional flows are often chosen for study in the literature
(e.g. [98], [59], [122], [60] and [151]). Each bi-directional flow instance is normally charac-
terised by the values of its features calculated separately in the client-to-server (forward) and
the server-to-client (backward) directions.
The definition of full-flow flow is illustrated in Figure 3.2.
Forward
Backward
Full-flow Server Client
Figure 3.2: An illustration of full-flow flow. The forward direction is normally defined as theclient-to-server direction
Figure 3.3 presents a visual illustration of how the features are calculated for full-flow in-
stances.
L B 1
L F 1 L F 2 L F 3 L F K
L B 2
IAT F 1 IAT F 2 IAT FK
IAT B 1 IAT B 2 IAT BJ
L B 3 L BJ L BJ-1
Forward
Backward
Figure 3.3: An illustration of the definition of flow direction and features calculation
Let LF1, LF2, ...,LFJ be the IP packet lengths of packet 1, 2, ... ,J in the forward direction.
Let LB1, LB2, ... , LBK be the IP packet lengths of packet 1, 2, ... , K in the backward direction.
3.2. THE APPLICATION OF ML IN IP TRAFFIC CLASSIFICATION 59
Then packet length features for the forward direction are calculated based on the statistics of
{LF1, LF2, ..., LFJ}. And features for the backward direction are calculated based on the statis-
tics of {LB1, LB2, ..., LBK}. Similarly, packet inter-arrival time (IAT) features in the forward
and backward directions are calculated based on the statistics of {IATF1, IATF2, ...,IATFJ} and
{IATB1, IATB2, ... IATBK} respectively.
3.2.1 Training and testing a supervised ML traffic classifier
Figure 3.4 presents an example scenario, in which the traffic classifier is intended to recognise
real-time online game traffic (the application class of interest) among the usual mix of traffic
seen on an IP network.
VoIP
Game
Web, P2P, SSH, SMTP
Traffic classifier
Game
Other
Figure 3.4: A simple scenario of online game traffic classification
Figure 3.5 illustrates the steps involved in building a traffic classifier using a supervised ML
algorithm. As noted earlier, the optimal approach to training a supervised ML algorithm is to
provide previously classified examples of two types of IP traffic: traffic matching the class of
traffic that one wishes later to identify in the network (in this case online game traffic), and
representative traffic of entirely different applications one would expect to see in future (often
referred to as Interfering or Other traffic).
The lower part of Figure 3.5 (Training) expands on the sequence of events involved in train-
ing a supervised ML traffic classifier. First, sample traffic is collected for both the application of
interest (e.g. game traffic) and other interfering applications (such as VoIP, Web, P2P, SSH, and
60 CHAPTER 3. A BRIEF BACKGROUND ON MACHINE LEARNING
Labelled 'Game' class
VoIP
Game
Web, P2P, SSH, SMTP
Traffic classifier
Game
Other
ML
Classification model Game or Other
Classification
Training
Labelled 'Other' class
Optional data sampling and features filtering/selection
Features calculation Features calculation
Game traffic Web, P2P, SSH,
SMTP traffic VoIP traffic
Figure 3.5: Training and classification for a two-classes supervised ML traffic classifier
3.2. THE APPLICATION OF ML IN IP TRAFFIC CLASSIFICATION 61
SMTP) that the classifier may see in the network. The ‘features calculation’ step involves cal-
culating the statistical properties of these flows (such as mean packet inter-arrival time, median
packet length and/or flow duration) as a prelude to generating features.
An optional next step is ‘data sampling’ or ‘features filtering/selection’, designed to narrow
down the search space for the ML algorithm when faced with extremely large training datasets
(traffic traces). The data sampling step extracts statistics from a subset of instances of various
application classes, and passes these along to the classifier to be used in the training process.
As noted in section 3.1.7, a feature filtering/selection step is desirable to limit the number of
features actually used to train the supervised ML classifier, and thus create an effective classi-
fication model. The input into the ML step is a dataset with training instances for both classes,
presented as a matrix of instances versus features as illustrated in Figure 3.1.
The output of the ML training process is a classification model. It is used in the classification
(sometimes referred to as testing/evaluating) step (illustrated in the upper part of Figure 3.5) to
identify a new unknown flow as either Game or Other traffic. In this classification step, traffic
captured in real-time is used to construct flow statistics from which features are determined and
then submitted to the classification model. (Here we presume that the set of features calculated
from captured traffic is the same as the optimal feature set determined during training.) The
classifier’s output indicates which flows are deemed to be members of the class of interest.
Certain implementations may optionally allow the classification model to be updated in
real-time (performing a similar data sampling and training process). For controlled testing and
evaluation purposes offline traffic traces can be used instead of live traffic capture.
Cross-validation (or stratified cross-validation) may be used to generate accuracy evalua-
tion results during the training/classification steps. However, if the source dataset consists of
IP packets collected at the same time and the same network measurement point, the cross-
validation results are likely to over-estimate the classifier’s accuracy. (Ideally the source data
trace would contain traffic collected at different times and measurement points, using entirely
independent training and testing datasets.)
62 CHAPTER 3. A BRIEF BACKGROUND ON MACHINE LEARNING
3.2.2 Supervised versus unsupervised learning
As previously noted, IPTC is usually used to identify traffic belonging to known applications
(classes of interest) within previously unseen streams of IP packets. The key challenge is to
determine the relationship(s) between classes of IP traffic (as differentiated by ML features)
and the applications generating the IP traffic.
Supervised ML schemes require a training phase to cement the link between classes and ap-
plications. Training requires a priori classification (or labelling) of the flows within the training
datasets. For this reason, supervised ML may be attractive for the identification of a particular
(or groups of) application(s) of interest. However, as noted in section 3.1.3, the supervised ML
classifier works best when trained on examples of all the classes it expects to see in practice.
Consequently, its performance may be degraded or skewed if not trained on a representative
mix of traffic or if the network link(s) being monitored start seeing traffic of previously un-
known applications. (For example, Park et al. [152] demonstrated that accuracy is sensitive to
site-dependent training datasets, while Erman et al. [153] revealed different accuracy results
between the two data traces studied for the same ML algorithms.)
When evaluating supervised ML schemes in an operational context it is worthwhile consid-
ering how the classifier will be supplied with adequate supervised training examples, when it
will be necessary to re-train, and how the user will detect a new type of applications.
It might appear that one advantage of unsupervised ML schemes is the automatic discov-
ery of classes through the recognition of ‘natural’ patterns (clusters) in the dataset. However,
resulting clusters still need to be labelled (for example, through direct inspection by a human
expert) in order that new instances may be properly mapped to applications. (A related benefit
is that traffic from previously unknown applications may be detected by noting when new clus-
ters emerge - sometimes the emergence of new application flows is noteworthy even before the
identity of the application has been determined.)
Another issue for unsupervised ML schemes is that clusters do not necessarily map 1:1 to
applications. It would be ideal if the number of clusters formed were equal to the number of
application classes to be identified, and each application dominated one cluster group. However,
in practice, the number of clusters is often greater than the number of application classes [60]
[123]. One application might spread over and dominate a number of clusters, or conversely an
3.3. CHALLENGES FOR OPERATIONAL DEPLOYMENT 63
application might also spread over but not dominate any of the clusters. Mapping back from a
cluster to a source application can become a great challenge.
When evaluating unsupervised ML schemes in an operational context it is worthwhile con-
sidering how clusters will be labelled (mapped to specific applications), how labels will be up-
dated as new applications are detected, and the optimal number of clusters (balancing accuracy,
cost of labelling and label look-up, and computational complexity).
3.3 Challenges for operational deployment
3.3.1 A deployment scenario
Section 2.2.1 discussed the negative impacts of Last Mile bottlenecks on real-time interactive
traffic. Studies (such as [154], [155], [35], [156] and [157]) have shown that prioritisation
of real-time traffic over non-real-time traffic (such as ‘bursty’ TCP traffic) could improve the
perceived performance of the real-time traffic applications.
With the DOCSIS network considered in section 2.2.1, if the cable modem has the ability
to do class-based queuing and QoS scheduling, it can separate traffic into different queues,
and apply QoS scheduling mechanisms to them. The queuing and scheduling system requires
classification (identification) of traffic. While queuing and scheduling need to be done locally at
the CPE (e.g. embedded at the cable modem), traffic classification may be done at the ISP [35].
Suppose we have a classifier machine that listens to a limited number of packets of the traf-
fic flow, derives their statistical properties, then recognises the type of application that generates
the traffic. Once the flow has been classified, its classification rule can be logged into a data-
base, and communicated with the CPE. Its subsequent packets, as a result, can then quickly be
mapped to either a QoS class, put into a priority queue, or receive special network monitoring
and treatment when traversing the network. This deployment scenario is illustrated in Figure
3.6.
Figure 3.7 illustrates a sample of the operation of a classifier in a QoS-enabled architecture.
Data traffic passing a sniffer point is divided into separate flows (based on the five-tuple packet
header information: source and destination IP addresses, source and destination ports, and pro-
tocol) 5. These flows are passed through the classifier for identification. If a flow is classified as
5This five-tuple information serves purely as a flow’s differentiation. The numerical values or semantics therein
64 CHAPTER 3. A BRIEF BACKGROUND ON MACHINE LEARNING
VoIP
Online Game
Web, P2P, SSH, SMTP
Cable Modem
Traffic classification QoS mapping for priority control
Trigguring QoS signalling process
Copy of traffic
ISP
Copy of traffic
Downstream traffic Upstream traffic
Figure 3.6: Example of an automated QoS and priority control
one requiring prioritisation, the classifier signals the CPE with this information; the CPE will
use this to apply priority queuing and scheduling for the flow.
The actual number of QoS classes and associated prioritisation levels used in such a sce-
nario will vary depending on customer requirements and ISP capabilities. Diffserv-style QoS (a
small handful of classes, or even only two classes [158]) is often considered sufficient, so long
as no individual QoS class is overloaded. In principle one might imagine hundreds or thou-
sands of different applications being mapped into a small number of QoS classes. In practical
consumer situations there are likely to be only a small number of applications (such as online
games or VoIP) that require prioritisation (with the default being that unrecognised flows are
not prioritised).
A further consideration involves applications whose QoS requirements and statistical traffic
properties vary over time. For example, game traffic features may vary during different phases
of the game. It is ultimately a business decision whether each phase is mapped to a different QoS
class, or always mapped to the high priority class. As noted in section 2.2.3, an ISP will aim for
the simplest technical solution that satisfies their customer’s goals. My focus is to identify an
IPTC technique capable of flexibly supporting the mapping of application traffic (across all or
parts of one or more applications flows’ lifetimes) to QoS classes. Operational challenges for
such classification are addressed in the following sub-section.
are not important to our ML-based classifier; as soon as the combination of the actual binary bits for IP addresses,port numbers and protocol make an unique identification of a flow, this can be used later to identify the flow’ssubsequent packets for prioritisation. In rare cases where the IP/TCP headers are encrypted, as soon as this com-bination stays constant for a period of time, it can be used as a unique key to distinguish a flow.
3.3. CHALLENGES FOR OPERATIONAL DEPLOYMENT 65
Data passing the sniffer – Sequenced by arrival time
Separate data into flow (five - tuple – source and destination IPs and ports and protocol) – passing through the classifier
Feature Computations &
ML Classifier Model
Flow 1 Game
Flow N P2P
Flow 2 VoIP
Flow Classification Rule Table
Flow Classification Rule Table
Game P2P VoIP VoIP Game P2P
1 1 1
2 2
Flow 1 Flow 2 Flow N
Flow data passing the classifier – Sequenced by arrival time
Classifier window (e.g. packets
buffer) 2
Figure 3.7: Example operation of an IP flows classifier
66 CHAPTER 3. A BRIEF BACKGROUND ON MACHINE LEARNING
3.3.2 The operational challenges
Section 3.3.1’s scenario and the discussion in section 2.2.3 raise a number of key requirements
for a practical, deployable IP traffic classifier in an operational network. These requirements
may be summarised into five broad categories: accuracy; timely and continuous classification;
directional neutrality; efficient use of memory and processors; and portability and robustness.
They are described in turn below.
Accuracy
Accuracy is a critical requirement that can be measured in terms of Recall and Precision (2.3),
both of which are important. A classifier cannot be accepted, either by ISPs or consumers, if
it has low Recall (a high False Negatives Percentage) or low Precision (a high False Positives
Percentage). For example, if the class X application of interest is real-time and interactive, and
is desired to receive priority treatment when traversing the network, a low value of Precision
might not only seriously interfere with the QoS mapping, queuing and scheduling system, but
would also be unacceptable to the Internet customers who are charged for the priority traffic
they inject into the network. On the other hand, a low Recall rate would make the ISP fail to
meet the QoS level guaranteed to the customer.
Timely and continuous classification
A timely classifier should reach its decision using as few packets as possible from each flow
rather than waiting until each flow completes before finalising a decision. Reducing the number
of packets required for classification also reduces the memory required to buffer packets during
feature calculations. This is an important consideration for situations where the classifier is
calculating features for (10 of) thousands of concurrent flows. Depending on the business reason
for performing classification, it may be unacceptable to sample the available flows in order to
reduce the memory consumption. Instead, one must aim to use less packets from each flow.
However, it is not sufficient to classify based only on the first few packets of a flow. For
example, malicious attacks might disguise themselves with the statistical properties of a trusted
application early in their flow’s lifetime. Or the classifier itself might have been started (or
restarted) while hundreds or thousands of flows were already active through a network monitor-
3.3. CHALLENGES FOR OPERATIONAL DEPLOYMENT 67
ing point (thereby missing the starts of these active flows). Consequently, the classifier should
ideally perform continuous classification – recomputing its classification decision throughout
the lifetime of every flow.
Timely and continuous ML classification must also address the fact that many applications
change their statistical properties over time, yet a flow should ideally be correctly classified as
being the same application throughout the flow’s lifetime.
Directional neutrality
Application flows are often assumed to be bi-directional, and the application’s statistical fea-
tures are calculated separately in the forward and reverse directions. Many applications (such
as multiplayer online games or streaming media) exhibit different (asymmetric) statistical prop-
erties in the client-to-server and server-to-client directions. Consequently, the classifier must
either ‘know’ the direction of a previously unseen flow (for example, at which ends the server
and the client are located) or be trained to recognise an application of interest without relying
on external indications of directionality.
Inferring the server and client ends of a flow is fraught with practical difficulties. As a real
world classifier should not presume that it has seen the first packet of every flow currently being
evaluated, it cannot be sure whether the first packet it sees (of any new bi-directional flow of
packets) is heading in the ‘forward’ or ‘reverse’ direction. Furthermore, as noted in section 2.4,
the semantics of the TCP or UDP port fields should be considered unreliable, so it becomes
difficult to justify using ‘well known’ server-side port numbers to infer a flow’s direction.
Efficient use of memory and processors
Another important criteria for operational deployment is the classification system’s use of com-
putational resources (such as CPU time and memory consumption). The classifier’s efficiency
impacts on the financial cost of building, purchasing and operating large-scale traffic classifi-
cation systems. An inefficient classifier may be inappropriate for operational use regardless of
how quickly it can be trained or how accurately it identifies flows.
Minimising CPU cycles and memory consumption is advantageous whether the classifier is
expected to sit in the middle of an ISP network (where a small number of large, powerful devices
68 CHAPTER 3. A BRIEF BACKGROUND ON MACHINE LEARNING
may see hundreds of thousands of concurrent flows at multi-gigabit rates) or out toward the
edges (where the traffic load is substantially smaller, but the CPU power and memory resources
of individual devices are also diminished).
Portability and Robustness
A model may be considered portable if it can be used in a variety of network locations, and
robust if it provides consistent accuracy in the face of network layer perturbations such as
packet loss, traffic shaping, packet fragmentation, and jitter. A classifier also is robust if it can
efficiently identify the emergence of new traffic applications.
3.4 Conclusion
In this chapter I have provided background information about ML and how it could be applied
to IPTC. More information on ML algorithms can be found in [109], [121].
I have also addressed the crucial requirements for a practical and deployable real-time IP
traffic classifier, namely Accuracy, Timely and continuous classification, Directional neutrality,
Efficient use of memory and processors, Portability and Robustness. These critical factors not
only lay emphasis on the technical viability of a solution (by meeting the Accuracy, Timely
and continuous classification and Directional neutrality requirements), but also address other
requirements for an economically feasible and deployable solution (by meeting the Efficient use
of memory and processors, Portability and Robustness requirements). The importance of these
requirements is justified in section 2.2 when considering the context of IPTC and its important
role as the core of most QoS solutions.
With a primary focus on the accuracy of ML-based traffic classifiers, most published re-
search to date 6 has not considered the constraints on classifiers deployed in real-time and
operational networks. My approaches proposed in Chapters 5 and 7 address these vital re-
quirements. I consider not only the real-time requirements of an ML traffic classifier, but also
its sustainable performance when monitoring traffic flows over their lifetime with limited phys-
ical resources. This is what makes my contribution novel and significant.
6Prior to the publications of my proposals in late 2006 [159] [160].
3.4. CONCLUSION 69
In the next chapter, I review state-of-the-art IPTC approaches using ML techniques. A qual-
itative critique of the reviewed works is then presented, which leads to the problem statement
of my research.
Chapter 4
IP Traffic Classification Using MachineLearning
4.1 Introduction
In Chapter 2 I have shown that ML has potential for solving difficult IP network problems. I
also have provided some background in ML and discussed the application of ML algorithms to
IP traffic classification in Chapter 3. In this chapter I review the previous literature on applying
ML to IPTC, which can be divided into four broad categories:
• Clustering Approaches: Works whose main approach centres around unsupervised learn-
ing techniques.
• Supervised Learning Approaches: Works whose main approach centres around super-
vised learning techniques.
• Hybrid Approaches: Works whose approaches combine supervised and unsupervised
learning techniques.
• Comparisons and Related Work: Works that compare and contrast different ML algo-
rithms, or consider non-ML approaches that could be utilised in conjunction with ML
approaches.
The key points of each reviewed work are discussed in the following subsections and sum-
marised in Tables B.1, B.2, B.3 and B.4 (Appendix B.1).
This chapter demonstrates that most published research has focused primarily on the ac-
curacy of ML-based traffic classifiers, and has not considered the constraints on classifiers
70
4.2. CLUSTERING APPROACHES 71
deployed in real-time, operational networks. These studies have typically relied on features
calculated for full-flows consisting of thousands of packets - both for training and for sub-
sequent classification. The efficacy and timeliness of ML classifiers have not been evaluated
under conditions where the beginning of the flow is missed and the classifier sees only a subset
of its packets.
Yet, as mentioned in 3.3.2, in real IP networks traffic classifiers must reach decisions well
before a flow has finished. The classifier may start (or re-start) at any time, and may not see the
actual beginning of a flow. The application’s statistical behaviour may change over the lifetime
of each flow. In addition there may be thousands of concurrent flows, and the classifier has to
operate with finite CPU and memory resources.
Section 4.6 of this chapter discusses the limitations of the reviewed works with regards to
the operational challenges. This helps to define the problem statement for my thesis, justify its
originality and explain why it is worthwhile pursuing.
4.2 Clustering approaches
4.2.1 Flow clustering using Expectation Maximisation
In 2004 McGregor et al. [59] published one of the earliest work to apply ML to IP traffic clas-
sification using the Expectation Maximisation algorithm [161]. The approach clusters traffic
with similar observable properties into different application types.
This study examined HTTP, FTP, SMTP, IMAP, NTP and DNS traffic. Packets in a 6-hour
Auckland-VI trace were divided into bi-directional flows. Flow features (listed in Table B.1)
were calculated on a full-flow basis. Flows are not timed out, except when they exceed the
length of the traffic trace.
Based on these features, the EM algorithm was used to group the traffic flows into a small
number of clusters and then created classification rules based on these clusters. From these
rules, features that did not have a large impact on the classification were identified and removed
from the input to the learning machine and the process was repeated. The implementation of
EM in this study included an option to allow the number of clusters to be found automatically
via cross-validation. The resulting estimation of performance was then used to select the best
competing model (hence the number of clusters).
72 CHAPTER 4. IP TRAFFIC CLASSIFICATION USING MACHINE LEARNING
In this study, the algorithm was found to separate traffic into a number of classes based
on traffic type (such as bulk transfer, small transactions, or multiple transactions). However,
current results were limited in identifying individual applications of interest. Nonetheless, it
may be suitable to apply this approach as the first step of classification in cases where the traffic
is completely unknown, as it could possibly provide an indication of the group of applications
that have similar traffic characteristics. Importantly, the use of features that are calculated on
the basis of the completion of traffic flows hinders the application of the approach for real-time
IP traffic classification in an operational network.
4.2.2 Automated application identification using AutoClass
The work of Zander et al. [60], proposed in 2005, uses AutoClass [162], which is an unsuper-
vised Bayesian classifier, using the EM algorithm to determine the best clusters set from the
training data. EM is guaranteed to converge to a local maximum. To find the global maximum,
autoclass repeats EM searches starting from pseudo-random points in the parameter space. The
model with the parameter set that has the highest probability is considered the best.
Autoclass can be preconfigured with the number of classes (if known) or it can try to es-
timate the number of classes itself. Firstly, packets are classified into bi-directional flows and
flow characteristics are computed using NetMate [163]. A number of features are calculated for
each flow, in each direction (listed in Table B.1). Feature values are calculated on a full-flow
basis. A flow timeout of 60 seconds is used.
Sampling is used to select a subset of the flow data for the learning process. Once the
classes (clusters) have been learnt, new flows are classified. The results of the learning and
classification are exported for evaluation. The approach is evaluated based on random samples
of flows obtained from three 24-hour traffic traces (Auckland-VI, NZIX-II and Leipzig-II traces
from NLANR [164]).
Taking a further step from [59], this study proposed a method for cluster evaluation. A met-
ric called intra-class homogeneity, H, is introduced to assess the quality of the resulting classes
and classification. H of a class is defined as the largest fraction of flows on one application in
the class. The overall homogeneity H of a set of classes is the mean of the class homogeneities.
The goal is to maximise H to achieve a good separation between different applications.
4.2. CLUSTERING APPROACHES 73
The results of this study revealed that some separation between the different applications
could be achieved, especially for certain particular applications (such as Half-Life online game
traffic) in comparison with others. With different sets of features used, the authors demonstrated
that H increased with an increase in the number of features used. H reached a maximum value
of between 85% and 89%, depending on the trace. However, their work has not addressed
the trade-offs between the number of features used and the consequences of computational
overheads.
To compute the accuracy for each application the authors mapped each class to the applica-
tion that was dominating that class (by having the largest fraction of flows in that class). The
authors used accuracy (Recall) as an evaluation metric. Median accuracy was≥ 80% for all ap-
plications across all traces. However, there were some exceptional cases. For example, for the
Napster application there was one trace where it was not dominating any of the classes (hence
the accuracy is 0%). The results also indicated that FTP, HTTP and Telnet seemed to have the
most diverse traffic characteristics and were spread across many classes.
In general, although the mapping of class to application shows promising results in separat-
ing the different applications, the number of classes resulting from the clustering algorithm is
high (approximately 50 classes for 8 selected applications). For class and application mapping,
it is a challenge to identify applications that do not dominate any of the classes. The use of fea-
tures that require (or are calculated on the basis of) the completion of traffic flows also hinders
the application of the approach in real-time IPTC in an operational network.
4.2.3 TCP-based application identification using Simple K-Means
In 2006 Bernaille et al. [122] proposed a technique using an unsupervised ML (Simple K-
Means) algorithm that classified different types of TCP-based applications using the first few
packets of the traffic flow.
In contrast to the previously published work, the method proposed in this paper allows early
detection of traffic flow by looking at only the first few packets of a TCP flow. The intuition
behind this method is that the first few packets capture the application’s negotiation phase,
which is usually a pre-defined sequence of messages and is distinct among applications.
The training phase is performed offline. The input is a one-hour packet trace of TCP flows
74 CHAPTER 4. IP TRAFFIC CLASSIFICATION USING MACHINE LEARNING
from a mix of applications. Flows are grouped into clusters based on the values of their first
P packets. Flows are represented by points in a P-dimensional space, where each packet is
associated with a dimension; the coordinate on dimension p is the size of packet p in the flow.
Bi-directional flows are used. Packets sent by the TCP server are distinguished from packets
sent by the TCP client by having a negative coordinate.
Similarity between flows is measured by the Euclidean distance between their associated
spatial representations. After natural clusters are formed, the modelling step defines a rule to
assign a new flow to a cluster. (The number of clusters is chosen by trial with different numbers
of clusters for the K-means algorithm). The classification rule is simple: the Euclidean distance
between the new flow and the centre of each pre-defined cluster is computed, and the new
flow belongs to the cluster for which the distance is the least. The training set also consists of
payload, so that flows in each cluster can be labelled with their source application. The learning
output consists of two sets: one with the description of each cluster (the centre of the cluster),
and the other with the composition of its applications. Both sets are used to classify flows
online.
In the classification phase, packets are formed into a bi-directional flow. The sizes of the
first P packets of the connection are captured and used to map the new flow to a spatial repre-
sentation. After the cluster is defined, the flow is associated with the application that is the most
prevalent in that cluster.
The results reveal that more than 80% of total flows are correctly identified for a number of
applications by using the first five packets of each TCP flow. One exceptional case is the POP3
application. The classifier labels 86% of POP3 flows as NNTP and 12.6% as SMTP, because
POP3 flows always belong to clusters where POP3 is not the dominant application.
The results of this approach are inspiring for early detection of the traffic flow. However,
this approach assumes that the classifier can always capture the start of each flow. The effec-
tiveness of the approach when the classifier misses the first few packets of the traffic flow has
not been discussed or addressed. Furthermore, with the use of a unsupervised algorithm and its
classification technique, the proposal faces the challenge of classifying an application when it
does not dominate any of the clusters found.
4.2. CLUSTERING APPROACHES 75
4.2.4 Identifying HTTP and P2P traffic in the network core
The work of Erman et al. [123] in early 2007 addressed the challenge of traffic classification at
the core of the network, where the available information about the flows and their contributors
might be limited. This work proposed to classify a flow using only uni-directional flow informa-
tion. While indicating that for a TCP connection, server-to-client direction might provide more
useful statistics and better accuracy than the reverse direction, it may not always be feasible to
capture traffic in this direction. These researchers also developed and evaluated an algorithm
that could estimate missing statistics from a uni-directional packet trace.
The approach proposed makes use of clustering machine learning techniques with a demon-
stration of using the K-Means algorithm. Similar to other clustering approaches, Euclidean
distance is used to measure the similarity between two flow vectors.
Uni-directional traffic flows are described by a full-flow-based features set (listed in Table
B.3). Possible traffic classes include HTTP, P2P and FTP. For the training phase, it is assumed
that labels for all training flows are available (manually classified based on payload content and
protocol signatures), and a cluster is mapped back to a traffic class that makes up the majority of
flows in that cluster. An unseen flow will be mapped to the nearest cluster based on its distance
from the clusters’ centroids.
The approach is evaluated using flow accuracy and byte accuracy as performance metrics.
Three datasets are considered: datasets containing only client-to-server packets, datasets con-
taining only server-to-client packets, and datasets containing a random mixture of each direc-
tion. The K-Means algorithm requires the number of clusters as an input and it has been shown
that both flow and byte accuracies improved as k increased from 25 to 400. Overall, the server-
to-client datasets consistently gave the best accuracy (95% and 79% in terms of flows and bytes
respectively). With the random datasets, the average flow and byte accuracy was 91% and 67%
respectively. For the client-to-server datasets, 94% of the flows and 57% of the bytes were
correctly classified.
The algorithm to estimate the missing flow statistics is based on the syntax and seman-
tics of the TCP protocol. So it only works with TCP, not other transport protocol traffic. The
flow statistics are divided into three general categories: duration, number of bytes, and number
of packets. The flow duration in the missing direction is estimated as the duration calculated
76 CHAPTER 4. IP TRAFFIC CLASSIFICATION USING MACHINE LEARNING
with the first and the last packet seen in the observed direction. The number of bytes trans-
mitted is estimated according to information contained in acknowledgement (ACKs) packets.
The number of packets sent is estimated with the tracking of the last sequence number and ac-
knowledgement number seen in the flow, with regards to the maximum segment size (MSS). A
number of assumptions are made in this process. For example, MSS is used as a common value
of 1460 bytes, and simple acknowledgement strategy of an ACK (40-byte data header with no
payload) for every data packet, assuming that no packet loss or retransmission has occurred. In
this study, an evaluation of the estimation algorithm is reported, the results were promising for
flow duration and bytes estimation, with a relatively larger error range revealed for the number
of packets estimation.
This work addressed the interesting issue of the possibility of using uni-directional flow
statistics for traffic classification and proposed a method to estimate the missing statistics. A
related issue of directionality in the use of bi-directional traffic flows, based on the material and
discusion in this thesis, was addressed earlier in [160]. The use of features that are calculated on
the basis of the completion of traffic flows hinders the application of the approach in real-time
IP traffic classification in an operational network.
4.3 Supervised learning approaches
4.3.1 Statistical signature-based approach using NN, LDA and QDA algorithms
In 2004 Roughan et al. [61] proposed to use the nearest neighbours (NN), linear discriminate
analysis (LDA) and Quadratic Discriminant Analysis (QDA) ML algorithms to map different
network applications to predetermined QoS traffic classes.
The authors list a number of possible features, and classify them into five categories:
• Packet Level: e.g. packet length (mean and variance, root mean square).
• Flow Level: flow duration, data volume per flow, number of packets per flow (all with
mean and variance values) etc. Uni-directional flow is used.
• Connection Level: e.g. advertised TCP window sizes, throughput distribution and the
symmetry of the connection.
4.3. SUPERVISED LEARNING APPROACHES 77
• Intra-flow/connection features: e.g. packet inter-arrival times between packets in flows.
• Multi-flow: e.g. multiple concurrent connections between the same set of end-systems.
Of the features considered, the pair of most value was the average packet length and flow
duration. These features are computed per full-flow, then per aggregate of flows within 24-hour
periods (an aggregate is a collection of statistics indexed by server port and server IP address).
Three cases of classification are considered. The three-class classification looks at three
types of application: Bulk data (FTP-data), Interactive (Telnet), and Streaming (RealMedia).
The four-class classification looks at four types of applications: Interactive (Telnet), Bulk data
(FTP), Streaming (RealMedia) and Transactional (DNS). The seven-class classification looks
at seven applications: DNS, FTP, HTTPS, Kazaa, RealMedia, Telnet and WWW.
The classification process is evaluated using 10-times cross-validation. The classification
error rates are shown to vary depending on the number of classes the process has sought to
identify. The three-class classification has an lowest error rate, varying from 2.5% to 3.4% for
different algorithms, while the four-class classification had the error rate in the range of 5.1% to
7.9%, and the seven-class one had the highest error rate of 9.4% to 12.6%. The use of features
that are calculated on the basis of the completion of traffic flows hinders the application of the
approach in real-time IP traffic classification in an operational network.
4.3.2 Classification using Bayesian analysis techniques
In 2005 Moore and Zuev [98] proposed to apply the supervised ML Naive Bayes technique to
categorise Internet traffic by application. Traffic flows in the dataset used are manually classified
(based upon flow content) allowing accurate evaluation.
In this study, the classifier was trained using 248 full-flow based features (a summary is
listed in Table B.2). Selected traffic for Internet applications was grouped into different cate-
gories for classification, such as bulk data transfer, database, interactive, mail, services, HTTP,
P2P, attack, games and multimedia.
To evaluate the classifier’s performance, the authors used Accuracy and Trust (equivalent
to Recall) as evaluation metrics. The results showed that with the simple Naive Bayes tech-
nique, using the whole population of flow features, approximately 65% flow accuracy could be
achieved in classification. Two refinements for the classifier were performed, with the use of the
78 CHAPTER 4. IP TRAFFIC CLASSIFICATION USING MACHINE LEARNING
Naive Bayes Kernel Estimation (NBKE) and Fast Correlation-Based Filter (FCBF) methods 1.
These refinements helped to reduce the feature space and improved the classifier performance
to a flow accuracy of greater than 95% overall. With the best combination technique, the Trust
value for an individual class of application ranged, for instance, from 98% for HTTP, to 90%
for bulk data transfer, to approximately 44% for services traffic and 55% for P2P.
Thjs research was extended with the application of the Bayesian neural network approach in
[165]. It has been demonstrated that accuracy is further improved when compared to the Naive
Bayes technique. The Bayesian trained neural network approach is able to classify flows with
up to 99% accuracy for data trained and tested on the same day, and 95% accuracy for data
trained and tested eight months apart. This paper also presented a list of features including their
descriptions and ranking in terms of importance.
While achieving very good classification results, similar to that of other studies reviewed
in previous sections, this work made use of full-flow features. The use of features that are
calculated on the basis of the completion of traffic flows hinders the application of the approach
in real-time IP traffic classification in an operational network.
4.3.3 GA-based classification techniques
In 2006 Park et al. [166] made use of a feature selection technique based on the Genetic Algor-
ithm (GA). Using the same feature set specified in [152] (listed in B.2), three classifiers were
tested and compared: the Naive Bayesian classifier with Kernel Estimation (NBKE), Decision
Tree J48 and the Reduced Error Pruning Tree (REPTree) classifier. Their results suggest that
the two decision tree classifiers provide more accurate classification results than the NBKE
classifier. Their work also suggests the impact of using training and testing data from different
measurement points.
Early flow classification is also briefly mentioned. Accuracy as a function of the number of
packets used for classification is presented for J48 and REPTree classifiers. The first 10 packets
used for classification seem to provide the most accurate result. However, the accuracy result
1The NBKE method is a generalisation of Naive Bayes. It addresses the problem of approximating everyfeature by a normal distribution. Instead of using a normal distribution with parameters estimated from the data,it uses kernel estimation methods. FCBF is a feature selection and redundancy reduction technique. In FCBF,goodness of a feature is measured by its correlation with the class and other good features. That feature becomesgood if it is highly correlated with the class, yet is not correlated with any other good features [98]
4.3. SUPERVISED LEARNING APPROACHES 79
is provided as an overall result. It is not clear how it would be different for different types of
Internet applications. The effectiveness of the approach when the classifier misses the first few
packets of the traffic flow also has not been discussed or addressed.
4.3.4 Simple statistical protocol fingerprint method
Crotti et al. [167] in early 2007 proposed a flow classification mechanism based on three prop-
erties of the captured IP packets: packet length, inter-arrival time and packet arrival order. They
defined a structure called protocol fingerprints which expresses the three traffic properties in a
compact way and uses an algorithm based on normalised thresholds for flow classification.
There are two phases in the classification process: training and classifying. In the training
phase, pre-labelled flows from the application to be classified (the training dataset) are analysed
to build the protocol fingerprints. Uni-directional flow is used. A classifier on the path between
the client and the server will see a pair of flows in both directions.
At the IP layer, a flow with N packets can be characterised as an ordered sequence of N
pairs Pi = {si, ∆ti}, with 1 ≤ i ≤ N, where si represents the size of Packeti and ∆ti represents
the inter-arrival time between Packeti−1 and Packeti from a set of flows generated by the same,
known protocol, captured by a monitoring device; and L + 1 is the number of packets of the
longer-lived flows; the protocol’s fingerprint is generated as a Probability Density Function
vector PDF, that consists of L Probability Density Functions PDFi. The ithPDFi is built on all
the ith pairs Pi belonging to those flows that are at least i+1 packets long [167].
In order to classify an unknown traffic flow given a set of different PDFs, the authors check
whether the behaviour of the flow is statistically compatible with the description given by at
least one of the PDFs, and choose which PDF best describes it. An anomaly score that gives
a value between 0 and 1 is used to indicate how ‘statistically distant’ an unknown flow is from
a given protocol PDF. It shows the correlation between the unknown flow’s ith packet and the
application layer protocol PDFi described by the specific PDF used; the higher the value, the
higher the probability that the flow was generated by that protocol. To avoid the effects of noise
within the training data, a Gaussian filter is applied to each component of the PDF vector.
Their results reveal a flow accuracy of more than 91% for classifying three applications –
HTTP, SMTP and POP3 – using the first few packets of each application’s traffic flow.
80 CHAPTER 4. IP TRAFFIC CLASSIFICATION USING MACHINE LEARNING
In a similar way to the work of Bernaille et al. [122] reviewed above, this approach demon-
strates advanced results for timeliness of the classification. However, it has the same limitation
in assuming that the classifier can always capture the start of each flow, and is aware of the loca-
tions of client and server (for constructing the PDF of client-server and server-client directions).
The effectiveness of the approach when the classifier misses the first few packets of the traffic
flow (assumed to carry the protocol fingerprint) has not been addressed.
4.4 Hybrid approaches
Erman et al. [168] in early 2007 proposed a semi-supervised traffic classification approach
which combines unsupervised and supervised methods. Motivations for their proposal are
grounded in two main reasons. Firstly, labelled examples are scarce and difficult to obtain,
while supervised learning methods do not generalise well when being trained with few exam-
ples in the dataset. Secondly, new applications may appear over time, and not all of them are
known as a priori, traditional supervised methods map unseen flow instances into one of the
known classes, without the ability to detect new types of flows [168].
To overcome the challenges, the proposed classification method consists of two steps. First,
a training dataset consisting of labelled flows combined with unlabelled flows is fed into a
clustering algorithm. Second, the available labelled flows are used to obtain a mapping from
the clusters to the different known classes. This step allows some clusters to remain. To map a
cluster with labelled flows back to an application type, a probabilistic assignment is used. The
probability is estimated by the maximum likelihood estimate, n jknk
where n jk is the number of
flows that were assigned to cluster k with label j, and nk is the total number of labelled flows that
were assigned to cluster k. Clusters without any labelled flows assigned to them are labelled
‘Unknown’ as application type. Finally, a new unseen flow will be assigned to the nearest
cluster with the distance metric chosen in the clustering step.
This new proposed approach has promising results. Preliminary results have been shown
in [168] with the employment of the K-Means clustering algorithm. In this case, the classifier
was provided with 64,000 unlabelled flows. Once the flows were clustered, a fixed number
of random flows in each cluster were labelled. Results reveal that with two labelled flows per
cluster and K = 400, this approach results in a 94% flow accuracy. The increase in classification
4.5. COMPARISONS AND RELATED WORK 81
accuracy is marginal when five or more flows are labelled per cluster. Further discussion of
these results can be found in [169].
The proposal is claim to offer the following advantages: faster training time with a small
number of labelled flows mixed with a large number of unlabelled flows; being able to handle
previously unseen applications and the variation of existing applications’ characteristics; and
the possibility of enhancing the classifier’s performance by adding unlabelled flows for iterative
classifier training [169]. An evaluation of these advantages has not been performed in [169].
Nevertheless, these findings motivate my investigation into using only a small number of la-
belled samples (down-sampling) for clustering in assisting SSP-ACT, as presented in Chapter
6.
4.5 Comparisons and related work
4.5.1 Comparison of different clustering algorithms
In 2006 Erman et al. [153] compared three unsupervised clustering algorithms: K-Means,
Density Based Spatial Clustering of Applications with Noise (DBSCAN) and AutoClass. The
comparison was performed on two empirical data traces: one public trace from the University
of Auckland, and one self-collected trace from the University of Calgary.
The effectiveness of each algorithm is evaluated using overall accuracy and the number of
clusters it produces. Overall accuracy measurement determines how well the clustering algor-
ithm is able to create clusters that contain only a single traffic category. A cluster is labelled
by the traffic class that makes up the majority of its total connections (bi-directional traffic
flows). Any connection that has not been assigned to a cluster is labelled as noise. Then overall
accuracy is determined by the portion of the total TP for all clusters out of the total number
of connections to be classified. In all clustering algorithms, the number of clusters produced
by a clustering algorithm is an important evaluation factor as it affects the performance of the
algorithm in the classification stage.
The results of this study revealed that the AutoClass algorithm produced the best overall
accuracy. On average, AutoClass was 92.4% and 88.7% accurate in the Auckland and Calgary
datasets respectively. It produced on average of 167 clusters for the Auckland dataset (for less
than 10 groups of applications) and 247 clusters for the Calgary dataset (for four groups of
82 CHAPTER 4. IP TRAFFIC CLASSIFICATION USING MACHINE LEARNING
applications). For K-Means, the number of clusters can be set, and the overall accuracy steadily
improves as the number of clusters (K) increases. When K is around 100, overall accuracy
was 79% and 84% on average for the Auckland and Calgary datasets respectively. Accuracy
is improved only slightly with a greater value of K. The DBSCAN algorithm produces lower
overall accuracy (up to 75.6% for the Auckland and 72% for the Calgary data sets); however, it
places the majority of the connections in a small subset of the clusters. Looking at the accuracy
for particular traffic class categories, the DBSCAN algorithm has the highest precision value
for P2P, POP3 and SMTP (lower than Autoclass for HTTP traffic).
This study only briefly considers the model build time, and does not explore other perfor-
mance evaluation measurements, such as processing speed, CPU and memory usage, or the
timeliness of classification.
4.5.2 Comparison of clustering versus supervised techniques
In 2006 Erman et al. [170] evaluated the effectiveness of the supervised Naive Bayes and
clustering AutoClass algorithms. Three accuracy metrics were used for evaluation: Recall,
Precision and overall accuracy (overall accuracy is defined here as it is in [153], reviewed in the
previous section).
Classification using the supervised Naive Bayes algorithm is straight forward. For classifi-
cation using AutoClass, once AutoClass comes up with the most probable set of clusters from
the training data, the clustering is transformed into a classifier. A cluster is labelled with the
most common traffic category of the flows within it. If two or more categories are tied, then a
label is chosen randomly amongst the tied category labels. A new flow is then classified with
the traffic class label of the cluster to which it is most similar [170].
The evaluation was performed on two 72-hour data traces provided by the University of
Auckland (NLANR). A connection in this instance is defined as a bi-directional flow. The
feature set is shown in Table B.4.
This research indicated that with the dataset used and nine application classes (HTTP, SMTP,
DNS, SOCKS, IRC, FTP control, FTP data, POP3 and LIMEWIRE), AutoClass has an average
overall accuracy of 91.2% whereas the Naive Bayes classifier has an overall accuracy of 82.5%.
According to the authors, AutoClass also performs better in terms of Precision and Recall for
4.5. COMPARISONS AND RELATED WORK 83
individual traffic classes. On average, for Naive Bayes, both Precision and Recall for six out of
nine classes were of above 80%; whereas for AutoClass, all classes have Precision and Recall
values above 80%, six out of the nine classes have average Precision values of above 90%, and
seven have average Recall values of above 90%. However, in terms of the time taken to build
a classification model, AutoClass takes far longer than Naive Bayes algorithm (2,070 seconds
versus 0.06 seconds for the algorithm implementation, data and equipment used).
The conclusion that the unsupervised AutoClass outperforms the supervised Naive Bayes
in terms of overall accuracy might be counter-intuitive. Furthermore, another issue related to
clustering approaches is the real-time classification speed, as the number of clusters resulting
from the training phase is typically larger than the number of application classes. Neither of
these two issues have been explorered further in [170].
4.5.3 Comparison of different supervised ML algorithms
Williams et al. [171] in 2006 provided insights into the performance aspect of ML traffic clas-
sification. Their work compared a number of supervised ML algorithms: Naive Bayes with
Discretisation (NBD), Naive Bayes with Kernel Density Estimation (NBK) , C4.5 Decision
Tree, Bayesian Network, and Naive Bayes Tree. The computational performance of these algo-
rithms is evaluated in terms of classification speed (number of classifications per second) and
the time taken to build the associated classification model.
Results have been collected from experiments on three public NLANR traces. The features
used for analysis include the full set of 22 features, and two best reduced feature sets selected
by correlation-based feature selection (CFS) and consistency-based feature selection (CON)
algorithms. The features set is shown in Table B.4.
The results indicate that most algorithms achieve high flow accuracy with the full set of 22
features (the NBK algorithm achieves > 80% accuracy and the rest of the algorithms achieve
greater than 95% accuracy). With the reduced sets of eight (CFS) and nine (CON) features, the
results achieved by cross-validation reveal only slight changes in the overall accuracy compared
to the use of the full feature set. The largest reduction in accuracy was 2-2.5% for NBD and
NBK with the use of the CON reduced feature set.
Despite the similarity in classification accuracy, this study found significant differences in
84 CHAPTER 4. IP TRAFFIC CLASSIFICATION USING MACHINE LEARNING
classification computational performance. The C4.5 Decision Tree algorithm was seen as the
fastest algorithm when using any of the features set (with a maximum of 54,700 classifications
per second on a 3.4GHz Pentium 4 workstation running SUSE Linux 9.3 with Waikato Envi-
ronment for Knowledge Analysis (WEKA) implementation of ML algorithms). The ranking of
algorithms in descending order in terms of classification speeds is: C4.5 Decision Tree, NBD,
Bayesian Network, Naive Bayes Tree, NBK.
In terms of the required model build time, the Naive Bayes Tree algorithm takes significant
longer time than the other algorithms. The ranking of algorithms in descending order in terms of
required model build time is: Naive Bayes Tree, C4.5 Decision Tree, Bayesian Network, NBD
and NBK. Feature reduction is also shown to greatly improve performance of the algorithms in
terms of model build time and classification speeds for most algorithms.
These findings are inline with my results as presented in Chapters 6, 7 and 8, that C4.5
Decision Tree classifier takes longer time to build, but is faster than NBD classifier in terms of
classification speed.
4.5.4 ACAS: Classification using machine learning techniques on application signatures
Haffner et al. [172] in 2005 proposed an approach for the automated construction of application
signatures using machine learning techniques. In contrast to the other works, this work makes
use of the first n Bytes of a data stream as features. Although it shares the same limitation with
those works that require access to packet payloads, I include it in my literature review as it is
also ML-based, and its interesting results may be useful in a composite ML-based approach that
combines different information such as statistical characteristics, contents, and communication
patterns.
Three learning algorithms – Naive Bayes, AdaBoost and Maximum Entropy – have been
investigated in constructing application signatures for a various range of network applications:
FTP control, SMTP, POP3, IMAP, HTTPS, HTTP and SSH. A flow instance is characterised
with n Bytes represented in binary value, and ordered by the position of the Byte in the flow
stream. The collection of flow instances with binary features is used as input by the machine
learning algorithms.
Using the first 64 bytes of each TCP unidirectional flow the overall error rate is below 0.51%
4.5. COMPARISONS AND RELATED WORK 85
for all applications considered. Adaboost and Maximum Entropy provide the best results with
more than 99% of all flows classified correctly. Precision is above 99% for all applications
and Recall is above 94% for all applications except SSH (86.6%). (The poor performance on
SSH application was suspected due to the small amount of sample instances in the training
dataset). As with previously reviewed work on early traffic classification, the effectiveness of
this approach when the classifier misses the first few packets of the traffic flow (assumed to
carry the protocol fingerprint) has not been addressed.
4.5.5 BLINC: Multilevel traffic classification in the dark
Karagiannis et al. [53] in 2005 developed an application classification method based on the
behaviours of the source host at the transport layer, divided into three different levels. The social
level captures and analyses the interactions of the examined host with other hosts, in terms of
the numbers of them with which it communicates. The host’s popularity and that of other hosts
in its community’s circle are considered. The role of the host, in acting as a provider or the
consumer of a service, is classified at the functional level. Finally, transport layer information
is used, such as the four-tuple of the traffic (source and destination IP addresses, and source
and destination ports), and flow characteristics such as the transport protocol, and the average
packet size.
A range of application types was studied in this work, including HTTP, P2P, data transfer,
network management traffic, mail, chat, media streaming, and gaming. By analysing the social
activities of the host, the authors concluded that among the host’s communities, neighbouring
IPs may offer the same service (a server farm) if they use the same service port, exact communi-
ties might indicate attacks, while partial communities may signify P2P or gaming applications.
In addition, most IPs acting as clients have a minimum number of destination IPs. Thus, focus-
ing on the identification of that small number of servers can help client identification, leading
to the classification of a large amount of traffic. Classification at the functional level shows that
a host is likely to be providing a service if during a duration of time it uses a small number of
source ports, normally less than or equal to two for all of their flows. Typical client behaviour is
normally represented when the number of source ports is equal to the number of distinct flows.
The consistency of average packet size per flow across all flows at the application level is sug-
86 CHAPTER 4. IP TRAFFIC CLASSIFICATION USING MACHINE LEARNING
gested to be a good property for identifying certain applications, such as gaming and malware.
Completeness and accuracy are the two metrics used for the classification approach in this
case. Completeness is defined as the ratio of the number of flows (bytes) classified by BLINC
over the total number of flows (bytes), indicated through payload analysis. The results show
that BLINC can classify 80% to 90% of traffic flows with more than 95% flow accuracy (70%
to 90% for byte accuracy).
BLINC must gather information from several flows for each host before it can decide on the
role of one host. Such a requirement will present challenges to the employment of this method
in real-time operational networks.
4.5.6 Pearson’s Chi-Square test and Naive Bayes classifier
Focusing on the identification of Skype [173] traffic, in late 2007 Bonfiglio et al. [54] pre-
sented two tests: the first test, based on Pearson’s Chi-Square test, detects Skype’s fingerprint
through analysis of the message content randomness introduced by the encryption process; and
the second test, based on the Naive Bayes theorem, detects Skype’s traffic from its statistical
characteristics.
The aim of Pearson’s Chi-Square test in this context is to check if a message under analysis
complies with one of the Skype message formats, and can thus reveal fingerprints. The test is
based on the first few bits, bytes or the content of the whole message, dependent on the different
types of Skype traffic (e.g. Skype flows transported by UDP or TCP). The second test identifies
Skype flows based on message size (the segment size at the transport layer) and the average
packet inter-arrival time (called average-inter packet gap (average-IPG) in this work) features.
For a window of w packets, it characterises the message size distribution for each possible
Codec, using a number of joined Naive Bayes classifiers. The average-IPG is evaluated as 1w
times the time elapsed between the reception of the first and the wth packet in the window. A
single Naive Bayes classifier is used for the average-IPG.
The combination of the two tests is shown to be effective in detecting Skype voice traf-
fic over UDP or TCP, with almost zero percent of false positives, and a few percent of false
negatives.
The idea of using feature values averaged on a small window (a window size of 30 packets
4.6. LIMITATIONS OF THE REVIEWED WORKS 87
was chosen) for the Naive Bayes classifiers is similar to the idea of training on sub-flows in
SSP-ACT, however, its mechanism is different: it makes use of completely different feature
sets, with multiple Naive Bayes classifiers employed. (It is also worth noting that [54] was
published well after the basics of SSP-ACT had been published in 2006 [159] [160].
4.6 Limitations of the reviewed works
This section provides a qualitative look at the extent to which the reviewed works overlap with
the additional constraints and requirements for using ML techniques inside real-time IP traffic
classifiers outlined in section 3.3.2.
Table B.5 (Appendix B.2) provides a qualitative summary of the reviewed works in relation
to the following criteria.
4.6.1 Timely and continuous classification
Most of the reviewed work has evaluated the efficacy of different ML algorithms when applied
to entire datasets of IP traffic, trained and tested over full-flows consisting of thousands of
packets (such as [98], [61], [60], [59], [170], and [171]).
Some studies ([122] and [167]) have explored the performance of ML classifiers that utilise
only the first few packets of a flow, but they cannot cope with missing the flow’s initial packets.
4.6.2 Directional neutrality
The assumption that application flows are bi-directional, and that the application’s direction
may be inferred prior to classification, permeates many of the works published to date ([98]
[59] [122] [60] [151]). Most work has assumed that they will see the first packet of each
bi-directional flow, that this initial packet is from a client to a server. Classification models
are often trained using this assumption, and subsequent evaluations have presumed the ML
classifier can calculate features with a correct sense of forward and reverse direction. However,
in a real-world network a classifier can assume nothing about the direction (client to server or
vice versa) of the first packet captured, particularly if it misses a number of packets from the
actual start of a given flow.
88 CHAPTER 4. IP TRAFFIC CLASSIFICATION USING MACHINE LEARNING
4.6.3 Efficient use of memory and processors
There are definite trade-offs between the classification performance of a classifier and the re-
source consumption of the implementation. For example, [98] and [165] reveal excellent poten-
tial for classification accuracy. However, they use a large number of features, many of which are
computationally challenging. The overheads involved with computing complex features (such
as effective bandwidth based upon entropy, or Fourier Transform of the packet inter-arrival
time) must be considered against the potential loss of accuracy if one simply did without those
features.
Williams et al. [171] provide some pertinent warnings about the trade-off between training
time and classification speed. (For example, among five ML algorithms studied, Naive Bayes
with Kernel Estimation took the shortest time to build a classification model, yet performed
slowest in terms of classification speed.)
Techniques for timely and continuous classification have tended to suggest a sliding win-
dow over which features are calculated. Increasing the length of this window ([159], [160] and
[172]) might increase classification accuracy. However, depending on the method of implemen-
tation (whether it includes opportunities for pipelining, step size with which the window slides
across the incoming packet streams, etc.) this may decrease the timeliness with which classifi-
cation decisions are made (and increase the memory required to buffer packets during feature
calculations). Most of the reviewed work has not, to date, closely investigated this issue.
4.6.4 Portability and Robustness
None of the reviewed works has seriously considered or addressed the issue of classification
model portability mentioned in section 3.3.2.
None of the reviewed works has addressed and evaluated their model’s robustness in terms
of classification performance with the introduction of packet loss, packet fragmentation, delay
and jitter. Unsupervised approaches have the potential to detect the emergence of new types of
traffic. However, this issue has not been evaluated in most of the works. This issue was only
briefly mentioned in [168].
4.7. MY RESEARCH GOAL 89
4.7 My research goal
My goal is to identify and demonstrate a real-time, ML-based traffic classification system that
addresses the limitations identified in section 4.6. My primary focus is on the requirements
of timely and continuous classification, directional neutrality and efficient use of physical re-
sources.
4.8 Conclusion
In this chapter I have reviewed the state of the art in ML-based IP traffic classification. As
can be seen from the literature review, ML-based traffic classification provides very promising
results. However, all of the reviewed studies (prior to my published proposals in 2006 [159] and
[160]) have been concerned only with the accuracy of identifying traffic, but have overlooked
real-time operational deployment issues and requirements.
I have analysed the limitations of the reviewed proposals in terms of timely and continuous
classification, directional neutrality, efficient use of memory and processors, and portability and
robustness.
My research goal is to find an effective real-time, ML-based traffic classification system that
meets the requirements of timely and continuous classification, directional neutrality and effi-
cient use of physical resources. This is the motivation underlying my novel approach presented
in Chapters 5 and 7, while the portability and robustness of my proposed solution is evaluated
in Chapter 8.
Chapter 5
Training Using Multiple Sub-Flows toOptimise the Use of Machine LearningClassifiers in Real-World IP Networks
5.1 Introduction
In this chapter I present a novel modification to traditional ML training and classification tech-
niques. My technique optimises the classification of flows within finite periods of time and with
limited physical resources. I propose that realistic ML-based traffic classification tools should:
• Operate the ML classifier using a sliding window over each flow - the classifier can see
(or must use) no more than the N most recent packets of a flow at any given time.
• Train the ML classifier using sets of features calculated from multiple sub-flows - each
sub-flow is a fragment of N consecutive packets taken from different points within the
lifetime of the flow used for training.
N is chosen to reflect memory limitations in the classifier implementation or the upper limit
on the time allowed to classify a flow. Training on multiple sub-flows allows the sliding window
classifier to properly identify an application regardless of where within a flow the classifier
begins capturing packets.
I illustrate my proposal’s benefits by considering an ISP that wishes to automatically and
quickly detect online interactive game traffic mingled in amongst regular consumer IP traffic. I
apply my modifications to the well-known Naive Bayes and C4.5 Decision Tree algorithms and
90
5.2. MY PROPOSAL 91
demonstrate distinct improvements in classification accuracy and timeliness, compared to the
performance of the training approaches used in the literature.
This chapter is organised as follows. Section 5.2 illustrates and justifies my proposed app-
roach. The details of my experimental method are described in section 5.3. I analyse the results
in section 5.4, followed by some discussions and conclusions in sections 5.5 and 5.6.
5.2 My proposal
My goal is to classify traffic based on only the most recent N packets of a flow (for some
small value of N), which I have called the sliding window. This is driven by two primary
considerations. First, an ML classifier is likely to be part of a larger system (for example,
automated QoS control as discussed in section 3.3.1), that must react swiftly once it identifies a
new flow as belonging to a class of interest. Reducing the time taken to detect traffic of interest
implies reducing the number of packets that must pass the monitoring point before classification
can be achieved. Second, re-calculating features over a sliding window of N packets requires us
to buffer the most recent N packets. So we can remove the effect of the Nth most recent packet
when we receive a new packet in the same flow (hence the term sliding window). Particularly
on high-speed networks, a classifier may be observing (tens of) thousands of concurrent flows;
minimising the number of buffered packets per flow provides a beneficial reduction in physical
memory requirements.
A practical real-time classifier cannot assume it will see the beginning of all flows. For
example, classification may be initiated at a point in time when many thousands of flows are
already in progress. Thus, a classifier should be capable of recognising flows using N packets
starting from anywhere in a flow.
Using a sliding window of N packets does, however, poses potential problems. Application
flow statistics (such as the maximum packet inter-arrival time) over a small sliding window will
differ from those of the statistics over an entire long flow [174]. Application flow statistics also
often change during the lifetime of a flow. For example, the initial handshake of a new SMTP
connection may look quite different to the traffic while transferring the body of each email.
A classifier trained on feature values calculated for entire flows (as done in the majority of
previous research) may not recognise members of the class when presented with feature values
92 CHAPTER 5. TRAINING USING MULTIPLE SUB-FLOWS FOR REAL-TIME IPTC
calculated from subsets of an unknown flow.
The preceding considerations give rise to my novel proposal for training ML classifiers.
First, extract two or more sub-flows (of N packets) from every flow that represents the class
of traffic we wish to identify in the future. Each sub-flow should be taken from places in
the original flow that have noticeably different statistical properties (for example, the start and
middle of the flow). Each sub-flow would result in a set of instances with feature values derived
from its N packets. Then train the ML classifier with the combination of these sub-flows rather
than the original full-flows.
To illustrate my proposal the following scenario is constructed: a real-time classifier must
accurately identify Wolfenstein Enemy Territory (ET) [47] traffic mixed amongst other un-
related, interfering traffic flows. ET is a highly interactive online game representative of ap-
plications whose traffic characteristics can change noticeably over the lifetime of each flow.
I compare classification accuracy using full-flows and sub-flows for various values of N, and
show that training on full-flow performs poorly when classifying using a small sliding window.
Poor Precision and Recall are seen even with a large window of 1000 packets. On the other
hand, training with multiple sub-flows allows a small window of N = 25 packets to achieve high
Recall and Precision. Importantly, classification performance can be maintained even when
packets are missed at the beginning of a flow. An evaluation of my proposal with VoIP traffic is
be presented in Chapter 8.
5.3 My experimental approach
This section describes in details my experimental approach, including how to prepare the data
for training, and how to build and test a classification model.
5.3.1 Flows and features
In my experiments, for UDP traffic a flow is considered to have stopped when no more packets
are seen for 60 seconds in both directions. For TCP traffic, a flow is stopped when the connec-
tion is explicitly torn down or no packets are seen for 60 seconds in both directions (which ever
comes first) 1.1A TCP flow is known to be explicitly town down based on TCP header information. In strict situation where a
classifier is not allowed to be doing packet header inspection at all (even including TCP header information), only
5.3. MY EXPERIMENTAL APPROACH 93
In this chapter, I introduce a new term, sub-flow, which is defined as follows.
Sub-flow: each sub-flow is a fragment of N consecutive packets (bi-directional) taken from
different points within the original application flow’s lifetime. The forward direction of the
sub-flow is defined as it is in the full-flow: in the client-to-server direction.
Referring to the definition of full-flow presented in Figure 3.2, sub-flows are illustrated in
Figure 5.1. Let M (M ≥ 0) be the number of packets offset from the beginning of each full-flow,
and sub-flow SF-M denotes N consecutive packets starting from the Mth packet with regards to
the beginning of the full-flow.
Forward
Backward
Full-flow
Sub-flow SF-M
Packet 0 (The first packet) Packet M
Figure 5.1: An illustration of sub-flow definition
I trained and classified using the following features:
• Minimum, maximum, mean and standard deviation of inter-packet arrival time 2 in the
forward and backward directions.
• Minimum, maximum, mean and standard deviation of inter-packet arrival length 3 in the
forward and backward directions.
• Minimum, maximum, mean and standard deviation of IP packet length in the forward and
backward directions.
These features are chosen as they are independent of flow length and packet contents. They
also require low computation overhead. For each packet captured, only its length and arrival
timestamp are needed for feature calculation. The calculation of minimum, maximum, mean
a timeout can be used to determine when the flow is finished.2The difference in arrival times of two consecutive packets traversing in the same direction.3The difference in lengths of two consecutive packets traversing in the same direction.
94 CHAPTER 5. TRAINING USING MULTIPLE SUB-FLOWS FOR REAL-TIME IPTC
and standard deviation is simple and can be done incrementally as a packet arrives 4. This can
help to improve the classifier’s performance in terms of timeliness and processing speed, as well
as efficient use of memory and physical resources.
Features for full-flow and sub-flows are calculated by modifying the NETMATE tool 5
[163].
5.3.2 Machine Learning algorithms
The Naive Bayes and C4.5 Decision Tree algorithms were chosen because:
• The algorithms allow supervised training, so that we can train the ML for identification
of individual or group (class) of applications of interest.
• These are well-known supervised learning algorithms. They also have been used in other
IP traffic classification work, including [98] and [151].
• They have quite different internal training and classification mechanisms. Testing my pro-
posal with both algorithms reveals similar benefits in each case, suggesting the approach
is applicable to more than just one type of ML algorithm.
• The underlying statistical computation is simple, tractable and understandable. Classifi-
cation models can be expressed as decision trees or sets of classification rules.
I used the WEKA (Waikato Environment for Knowledge Analysis) implementation of the
Naive Bayes and C4.5 Decision Tree (J48) algorithms [109] 6.
4Since feature calculation is done incrementally as a new packet arrives, increasing the sliding window size(bigger N) does not increase the memory requirement to buffer information of the most recent N packets (discussedin section 5.2). However, it does increase the time required for a classification decision to be made, which includesthe waiting time for a packet to arrive and computational overhead.
5 NETMATE (Network Measurement and Accounting System) is a free, open-source network measurementsoftware. It was developed and maintained by Zander (Centre for Advanced Internet Architectures, SwinburneUniversity of Technology) and Schmoll (Fraunhofer Institute for Open Communication Systems (FOKUS), Berlin,Germany). The tool is written in C++. It has a modular (class-based) structure, which means it can easily beextended, and a dynamic loadable packet processing and information export modules.
6WEKA is a free, open-source ML and data mining tool written in Java. A number of standard ML techniqueshave been incorporated into the software. WEKA has a wide range of users, including ML researchers, industrialscientists and teachers [175]. A number of works in IP traffic classification make use of the tool, for example[98], [171], [153], [151], [60] and [59]
5.3. MY EXPERIMENTAL APPROACH 95
5.3.3 Some statistical properties of ET traffic
Wolfenstein Enemy Territory is an online, team-based first person shooter (FPS) game built on
the Quake III Arena game engine 7. It has representative characteristics for the online FPS game
genre, which are also ideal to demonstrate my proposal in that:
• Its traffic statistical properties vary over different phases during the flow lifetime.
• Its traffic is asymmetric in client-to-server and server-to-client directions.
A demonstration of the results I gathered with ET traffic can serve as the first-proof-of-
concept, and suggests the applicability of my approach for other similar types of traffic. (Elab-
oration on this point is covered in Appendix A where I show the similarities in statistical prop-
erties between ET and other applications.) This section introduces and analyses some critical
statistical properties of the ET traffic, justifying the reasons for my novel training approach.
Consistent with many other online FPS games, ET traffic seen at a server can exhibit three
different phases: clients probing the server (Probing phase), clients connecting to the server
(Connecting phase), and clients playing a game on the server (In-game phase) [176].
Figure 5.2 shows the variation of an ET flow’s characteristics as a scatter plot of two features
- standard deviation versus mean of packet length - calculated with N = 25 across 1000 ET flow
samples. In this illustrative example, the Probing phase’s features are calculated on sub-flows
that cover the first N packets of the full-flows; the Connecting phase’s features are calculated
on sub-flows of size N starting from the 20th packet; and the In-game playing phase’s features
are calculated on sub-flows of size N starting from the 2000th packet. Full-flow features are
calculated over the entire Probing, Connecting and In-game periods.
Full-flow and In-game feature values are shown on the right, Probing and Connecting feature
values are shown on the left. With only two features the regions are partially overlapping and
partially disjoint. While there is considerable overlap between the different phases, there is
also some separation, which needs to be learnt by a classifier to identify the different sub-flow
phases.
Similar behaviours are seen with the server-to-client direction, as illustrated in Figure 5.3.
A similar mix of overlapping and disjoint regions also occurs with other features (such as inter-7Detailed game settings and properties can be found in [176].
96 CHAPTER 5. TRAINING USING MULTIPLE SUB-FLOWS FOR REAL-TIME IPTC
0 50 150 250
050
100
150
200
250
300
Mean Packet Length (C−S) (bytes)
Pac
ket L
engt
h S
tdde
v.(b
ytes
)
C−S ProbingC−S Connecting
0 50 150 2500
5010
015
020
025
030
0Mean Packet Length (C−S) (bytes)
Pac
ket L
engt
h S
tdde
v.(b
ytes
)
Full−flowIn−game
Figure 5.2: Packet length from client to server for ET traffic - N = 25 packets
0 400 800 1200
010
030
050
0
Mean Packet Length (S−C) (bytes)
Pac
ket L
engt
h S
tdde
v.(b
ytes
)
C−S ProbingC−S Connecting
0 400 800 1200
010
030
050
0
Mean Packet Length (S−C) (bytes)
Pac
ket L
engt
h S
tdde
v.(b
ytes
)
Full−flowIn−game
Figure 5.3: Packet length from server to client for ET traffic - N = 25 packets
5.3. MY EXPERIMENTAL APPROACH 97
packet arrival time and inter-packet length variation). This suggests that a classifier trained on
full-flow feature values may have trouble recognising the clusters of feature values calculated
on small windows of packets.
The changes in ET’s statistical properties over different phases are clearer by looking at the
distribution of feature values calculated at different points during a flow’s lifetime. Figures 5.4
and 5.5 show the values of two features: mean and standard deviation of packet length from
client to server for ET traffic with a window size of 25 packets. The classifier’s window slides
across 1,000 ET flows, while M is the number of packets passed from the beginning of each
flow.
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K FF
010
030
050
070
0
M (packets)
Mea
n pa
cket
leng
th fe
atur
e va
lues
Figure 5.4: Mean packet length from client to server for ET traffic - N = 25 packets
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K FF
010
020
030
040
050
060
0
M (packets)
Std
. Dev
. pac
ket l
engt
h fe
atur
e va
lues
Figure 5.5: Standard deviation of packet length from client to server for ET traffic - N = 25packets
98 CHAPTER 5. TRAINING USING MULTIPLE SUB-FLOWS FOR REAL-TIME IPTC
The results are presented using boxplots 8. As shown in figures 5.4 and 5.5, the Probing
phase (M = 0) and In-game phase (M > 1000) have quite different ranges of values, while the
Connecting phase or early In-game phase (10 ≤ M ≤90) has a large range of values, which
seems to cover the value range of both the Probing and In-game phases.
Figures 5.6 and 5.7 show the mean packet length and standard deviation of packet length
features calculated over the first N packets of ET flows in the client-to-server direction. The
results are collected from 1,000 flow samples. The sample set contains flows of longer than
1,000 packets, with a median flow size of 60,000 packets.
10 25 50 500 1000 Full−flow
5010
015
020
025
030
0
Window size N (packets)
Mea
n pa
cket
leng
th (p
acke
ts)
Figure 5.6: Mean packet length in the client-to-server direction, calculated for the window ofthe first N packets taken from 1,000 flow samples for ET traffic (1,000 values of the means foreach N value)
From the plots it appears that these statistics calculated for different N values are different
from each other and different from those calculated for full-flow. The differences in these fea-
ture values are significant when calculated for a small window compared to full-flow. Features
calculated for a larger windows yield closer distributions to those calculated on full-flow.
8The black line in the box indicates the median; the bottom and top of the box indicates the 25th and 75th
percentile, respectively. The vertical lines drawn from the box are whiskers. The upper cap is drawn at the largestobservation that is less than or equal to the 75th percentile + 1.5*IQR (interquartile range - which is essentiallythe length of the box). The lower cap is drawn at the smallest observation that is greater than or equal to the 25th
percentile - 1.5*IQR. Any observations beyond the caps are drawn as individual points. These points indicateoutliers.
5.3. MY EXPERIMENTAL APPROACH 99
10 25 50 500 1000 Full−flow
050
100
150
200
Window size N (packets)
Std
.dev
. pac
ket l
engt
h (p
acke
ts)
Figure 5.7: The standard deviation of packet length in the client-to-server direction, calculatedfor window of the first N packets taken from 1,000 flow samples for ET traffic (1,000 values ofthe standard deviations for each N value)
However, I use the two-sample Kolmogorov Smirnov (two-sample KS) test 9 [178] to show
that there is strong evidence that they are different.
Table 5.1 shows the p-values for pairs of two feature sets calculated for two different N
values. Each feature set contains 1,000 means packet length of the first N packets of 1,000
flow samples). With each pair of feature sets the p-value is the probability that both of the
observed results could arise by chance from the same parent source. A p-value of less than
0.05 is regarded as probably significant and is usually the threshold at which the null hypothesis
(that both results come from the same source) is rejected [178]. In all cases, other than when
comparing a sample set with itself, we observe very small p-values indicating strong evidence
that the distributions are different. Similar characteristics have been seen with other features.
Consequently, this confirms that ET’s statistics calculated for different N values are different
from each other and different from those calculated for full-flow.
Figures 5.2 and 5.3 also demonstrate another important aspect of ET traffic. They suggest
the asymmetry of ET flows’ characteristics for traffic in the client-to-server and server-to-client
directions, which motivates my work as presented in Chapter 7.
9My sample datasets are not normally distributed for all N values, according to the results when applying theAnderson-Darling test [177] to the datasets. Hence the two-sample K-S test is chosen as a general nonparametricmethod for comparing two samples. It quantifies a distance between the empirical distribution functions of the twosamples. The null hypothesis is that the samples are drawn from the same distribution.
100 CHAPTER 5. TRAINING USING MULTIPLE SUB-FLOWS FOR REAL-TIME IPTC
Table 5.1: Two-sample KS test p-values (probability of occurrence of the null hypothesis) forthe mean packet length feature sets calculated for different N values, based on a set of 1000flow samples
N 10 25 5010 1 <2.2e-16 1.6e-0425 <2.2e-16 1 <2.2e-1650 <2.2e-16 <2.2e-16 1500 <2.2e-16 <2.2e-16 <2.2e-161000 <2.2e-16 <2.2e-16 <2.2e-16Full-flow <2.2e-16 <2.2e-16 <2.2e-16
N 500 1000 Full-flow10 <2.2e-16 <2.2e-16 <2.2e-1625 <2.2e-16 <2.2e-16 <2.2e-1650 <2.2e-16 <2.2e-16 <2.2e-16500 1 <2.2e-16 <2.2e-161000 <2.2e-16 1 <2.2e-16Full-flow <2.2e-16 <2.2e-16 1
5.3.4 Constructing training and testing datasets
I demonstrate the effectiveness of my approach using different datasets for training and testing
the classifiers (as described in section 3.2) [109].
Flows in each dataset are divided into two classes - ET and Other (non-ET) - because super-
vised learning algorithms work best when trained with examples of traffic in the class of interest
and traffic known to be outside the class of interest (‘interfering’ or Other traffic).
The high-level description of data traces used in the training and testing phases are shown
in Figure 5.8. Details of training and testing steps of an ML classifier were presented in Figure
3.5.
ET traffic
The ET datasets consist of two separate month-long traces collected during May and Septem-
ber 2005 at a public ET server in Australia [179]. The server was running ETPro (v3.2.0) [47].
Full-payload traffic was captured to disk with timestamps of microsecond resolution and ac-
curacy of better than +/-100usec. The distribution of domestic and international traffic on this
server was consistent with previously published work [180]. More information on geographical
5.3. MY EXPERIMENTAL APPROACH 101
ET Traffic Training Dataset
Other Traffic Training Dataset
ML Training
ET Traffic Testing Dataset
Other Traffic Testing Dataset
ML Classification
Classification Results
Figure 5.8: High-level description of datasets used for training and testing
distribution of game clients in terms of countries and hop counts is presented in Appendix C.
Raw ET traffic traces taken at an ET server typically contain far more short flows (clients
probing the server, usually less than 10 packets from client to server) than actual game-play
flows [180]. Balanced ET datasets for each month were created by taking all non-probe flows
(assumed to have more than 10 packets from client to server) and then sampling an equal number
of probe flows from the raw monthly traces. Table 5.2 summarises the resulting balanced ET
datasets.
Table 5.2: ET traffic full-flow datasetMonth Non-
ProbeFlows
ProbeFlows
TotalFlows
Total Packets Total Bytes
May 4344 4344 8688 107.9M 14.9GSep 3444 3444 6888 187.9M 26.6G
Other traffic
The interfering (non-ET) traffic is constructed from two 24-hour data traces collected by the
University of Twente, Netherlands, on February 6th and 7th 2004 [181]. I will refer to these
traffic sources as T1 and T2 respectively.
The interfering traffic datasets were built by extracting flows from T1 and T2 belonging to
a range of common applications. As payloads were missing I inferred application type from the
port numbers (judged an acceptable approach because my primary criteria for interfering traffic
is that it was not ET). For each application’s default port(s) I sampled a maximum of 10,000
102 CHAPTER 5. TRAINING USING MULTIPLE SUB-FLOWS FOR REAL-TIME IPTC
flows per raw trace file 10. Table 5.3 summarises the overall mix of traffic in my resulting
interfering datasets.
Table 5.3: Sampled interfering application flows - full-flow datasetApplications Total Flows
(x1000)(T1)Total Bytes(MB) (T1)
Total Flows(x1000) (T2)
Total Bytes(MB) (T2)
HTTP, HTTPS (Web) 13.8 329.2 13.3 267.2DNS, NTP (DNS etc.) 2.4 1.4 2.7 1.4SMTP, IMAP, POP3,Telnet, SSH (Mail etc.)
0.6 15.8 0.5 10.1
HalfLife 8.7 25.4 10.0 38.6Kazaa, Bittorrent,Gnutella, eDonkey(P2P)
48.0 1,354.6 56.4 1,524.5
For each experiment described below, I trained my classifiers using a mix of ET traffic from
the May dataset and interfering traffic from T2. Subsequent testing of each classifier scenario
was performed using a mix of ET traffic from September and interfering traffic from T1.
Training with full-flow, testing with four different sliding windows
First I look at the effectiveness of classifying data using a sliding window across the test dataset
and an ML classifier trained on full-flow. I use windows of sizes N = 10, 25, 100, and 1000
packets. During ET game-play we see 20 packets per second (PPS) from server to client and
roughly 28 PPS from client to server, so these windows correspond to 0.2, 0.5, 2.1, and 20.8
seconds of actual time. Recall and Precision results are averaged across ET flows and interfering
flows in the test dataset. I show that a classifier trained on full-flow is ineffective in identifying
ET traffic even with N as large as 1,000 packets.
Then I look at the effectiveness of different modified approaches in training the classifier in
the following experiment. In the scenario considered in section 5.2, my goal is to classify ET
traffic in less than one second for a timely classification. N = 25 (roughly corresponds to a time
window of 0.5 second) is chosen as a good size of the sliding window for testing.
10Please note that P2P applications have a range of port numbers to which a server can default. All server portsin the range are used; hence the large number of flows collected.
5.3. MY EXPERIMENTAL APPROACH 103
Training with full-flow instances of more than 25 packets (called filtered full-flow), testingwith a sliding window of N = 25 packets
The motivation for this modification is that the classifier only performs its classification on a
full sliding window. So only flows with more than 25 packets will be classified. Training the
classifier with flow instances shorter than 25 packets may only add noise to the classification
model, and also may incur a longer training time.
Training with individual sub-flow, testing with a sliding window of N = 25 packets
The motivation for this modification is from my analysis of ET’s statistical properties, which
suggests a classifier trained on full-flow may have difficulties when classifying on small win-
dows of packets, and that it should be trained on sub-flows with the same length as the sliding
window.
Training with multiple sub-flows, testing with a sliding window of N = 25 packets
The motivation for this modification is that the flow’s statistical characteristics vary over dif-
ferent phases during a flow’s duration. Training on a combination of multiple sub-flows rep-
resenting different phases within the original full-flow, the classifier will then recognise new
flows if they have statistical properties similar to any of the sub-flows on which the classifier
was trained.
The detailed training and testing implementation is summarised in Table 5.4 and 5.5.
Please note that from the original data trace, only flows that have at least N + M packets are
considered in each test.
Figure 5.9 presents the detailed combination of traffic (in flows and percentage) for training
the classification models. Close to 90% of flows in the full-flow model are shorter than 25
packets, hence the large reduction in flow instances to train the filtered full-flow model 11.
The ET class takes 9.5% and 25.0% out of the total training instances for the full-flow and
filtered full-flow models respectively. For the single sub-flow model, ET traffic takes from
10.5% to 14.1% of the total training instances. There are approximately 21.2% of ET instances
11The filtered sub-flow model has a slightly smaller number of training instances compared to the SF-0 model,since its flows are sampled from the training dataset for the full-flow model, a subset of the whole training datasetfrom where training instances for the SF-0 model are sampled.
104 CHAPTER 5. TRAINING USING MULTIPLE SUB-FLOWS FOR REAL-TIME IPTC
Full-
Flow
Filte
red
Full-
Flow
SF-
0
SF-
20
SF-
40
SF-
2000
Mul
ti-S
Fs
Mail etc. HalfLife
DNS etc. ET
Web P2P
0
10000
20000
30000
40000
50000
60000
Num
ber o
f flo
ws
(a) Flow counts
Full-
Flow
Filte
red
Full-
Flow
SF-
0
SF-
20
SF-
40
SF-
2000
Mul
ti-S
Fs
Mail etc. HalfLife
DNS etc. ET
Web P2P
0
10
20
30
40
50
60
70
Perc
enta
ge (%
)
(b) Flow percentage
Figure 5.9: Distribution of different applications’ traffic (in flows and percentage) in the trainingdatasets
5.3. MY EXPERIMENTAL APPROACH 105
Table 5.4: Detailed training and testing implementation for each experimentExperiment Training TestingTraining withfull-flow, test-ing with fourdifferent slid-ing windows
The classifier is trained using fea-tures calculated from full-flows fromthe training datasets, for both ET andOther traffic
The classifier is tested using featurescalculated for sliding windows of N =10, 25, 100 and 1000. For ET traffic,M is chosen to cover two periods - earlyclient contact with the game server (0≤M ≤ 90) and during active game-play(1000 ≤M ≤ 9000). For the other traf-fic, M is chosen to be 19. (This is chosenarbitrary, the important thing is its valueof greater than 0, so that we can test theclassifier in the extreme case of missingthe beginning of the test flows.)
Training withfiltered full-flow, testingwith a slidingwindow of N= 25 packets
For both ET and Other traffic the clas-sifier is trained using features calcu-lated from instances with more than25 packets in both directions from thefull-flow training datasets.
The classifier is tested using featurescalculated from a sliding window of N= 25 packets. Similar to the above test,M is chosen to cover two periods: 0 ≤M ≤ 90 and 1000 ≤ M ≤ 9000 for ETtraffic, and M = 19 for Other traffic.
in the training datasets for the multiple sub-flows model.
The ET data trace used for training contains flows of unequal lengths, hence the reduction
of flow examples used for training when the sub-flow is chosen towards the end of the flows.
As a result, for the multiple sub-flows model, there are slight differences in the total num-
bers of training instances per sub-flow. There are more training instances for sub-flows at the
start, compared to sub-flows towards the end of a flow. SF-0 has the greatest number of ex-
amples (2,500 instances), compared to other sub-flows (consisting of between 1,000 and 1,500
instances). This may give more weight toward sub-flows selected at the early phase of flows
used for training. As a result, there is a possibility of inconsistent performance of the classifier
with regards to the position of the sliding window during the flow’s duration (e.g. better Re-
call when the sliding window is at the early stage of a flow as more instances are available for
training). In this experimental study, I chose to use maximum sub-flow instances available in
the dataset to maximise the information in training an accurate classifier. Evaluating the pros
and cons of applying re-sampling techniques (discussed in section 3.1.8) to create a balance
between different sub-flows is left for future research.
106 CHAPTER 5. TRAINING USING MULTIPLE SUB-FLOWS FOR REAL-TIME IPTC
Table 5.5: Detailed training and testing implementation for each experiment (continued)Experiment Training TestingTraining with individ-ual sub-flow, testingwith a sliding windowof N = 25 packets
ET traffic: The classifier is trained us-ing features calculated from 25-packetsub-flows rather than full-flows. Fourseparate variants of the classifier aretrained, using sub-flows that coverspackets 0-24, 20-44, 40-64, and 2000-2024 respectively of the original ETflows in the training dataset. Thesesub-flows are selected to represent thestatistical properties of ET traffic overdifferent phases of the original ETflows. These classifier models are de-noted by SF-0, SF-20, SF-40, and SF-2000 respectively.
Similar to the above test,the testing instances arebuilt using features from25 packet sub-flows thatcovers two periods - 0≤ M ≤ 90 and 1000 ≤M ≤ 9000 for ET traf-fic. M is chosen to be19 for Other traffic, sothat the sliding windowis at a different posi-tion from where the sub-flows were chosen fortraining. This is to avoidbiasing the test results.
Other traffic: As most of the Otherflows are short (95% of interferenceflows are shorter than 50 packets),only two phases are considered fortraining: the Beginning phase thatcovers packets 1-25 and the Middlephase that covers packets 10-34 of theoriginal interference flows in the train-ing dataset. For the SF-0 model, ETtraffic is trained in combination withfeatures calculated from the Begin-ning phase of the Other traffic. Forother sub-flow models, ET traffic istrained in combination with featurescalculated from the Middle phase ofthe Other traffic.
Training with multiplesub-flows, testing witha sliding window of N= 25 packets
ET traffic: The classifier is trained us-ing features calculated from differentsub-flows.
The same with testingfor individual sub-flow.
Other traffic: The classifier is trainedusing features from 25-packet sub-flows that cover the Beginning andMiddle phases of the original fullinterference flows in the trainingdatasets.
5.3. MY EXPERIMENTAL APPROACH 107
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
Mail etc. HalfLife DNS etc. ET Web P2P 0
1000
2000
3000
4000
5000
6000
7000
Num
ber o
f flo
ws
M (Number of packets offset from the beginning of each flow)
(a) Flow counts
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
Mail etc. HalfLife DNS etc. ET Web P2P 0
10
20
30
40
50
60
Perc
enta
ge (%
)
M (Number of packets offset from the beginning of each flow)
(b) Flow percentage
Figure 5.10: Distribution of different applications’ traffic (in flows and percentage) in testingdatasets for N = 25
108 CHAPTER 5. TRAINING USING MULTIPLE SUB-FLOWS FOR REAL-TIME IPTC
Figures 5.10 presents a detailed combination of traffic (in flows and percentages) for testing
different classification models with N = 25 packets. Similar to the training datasets, the total
number of ET flows reduces as N and M increase. With different N values (shown in Appendix
C.2), the ratios of classes’ instances are different. ET is the minority class with N = 10 packets,
and is the majority class with N = 1,000 packets.
In addition, the proportions of applications’ traffic in the training and testing datasets are
different. For example, with N = 25 packets, the ET class takes approximately 11.9-17.1% out
of the total testing instances for all M values 12. This is different from the traffic profile of all
training models. This suggests that the classifier may not have to be trained on the same traffic
profile with the one it expects to see in deployment.
5.3.5 Data processing
To build and test the classification model, a large range of sample traffic was collected. ET
traffic used for training and testing was extracted from two month-long traces with a total size
of 41.5 GBytes of ET data. Other traffic was extracted from two day-long traces with a total of
3.6 GBytes.
Each data trace needs to be processed for full-flow and sub-flow feature values. A total of
six classification models need to be trained and tested with features calculated for 19 different
positions of the sliding window for ET traffic and two different positions for Other traffic. The
data processing therefore is time-consuming if done sequentially in a single processing unit.
In my experiments, the data processing was undertaken in parallel using the supercomputer
cluster provided by the Centre for Astrophysics and Supercomputing, Swinburne University of
Technology [182] 13.
12My Precision result by definition can be impacted by percentage of traffic mix (due to the number of falsepositives coming from Other class). It can be lower than when a balance 50% Other vs. 50% ET is used instead.However, to have more data for testing, and my assumption is that the traffic of a single application can be muchsmaller than the total traffic aggregate. I accept that my Precision can be negatively affected by the traffic mixchose.
13Each node of the supercomputer cluster is a Single 2 quad-core Clovertown 64-bit low-volt Intel Xeon 5138processor running Linux CentOS 5 at 2.33 GHz. Virtual memory is set at 1GByte. Jobs are submitted to the clustervia a batch queue system. I used WEKA implementation version 3.4 with Java version 1.4.2.
5.4. RESULTS AND ANALYSIS 109
5.4 Results and analysis
In this section I present the results with respect to M, the number of packets offset from the
beginning of each ET flow in the test dataset.
5.4.1 Training with full-flows, testing with four different sliding windows
Figure 5.11 shows Recall for the Naive Bayes classifier as each sliding window moves across
the test dataset. For all N values, Recall degrades rapidly as we move further from the start of
each flow. Recall for N = 1000 is good (85%) when the flow is captured from the beginning,
but rapidly drops below 10% if the classifier misses more than the first 30 packets. Recall for
small sliding windows is poor (≤ 66%) even when the beginning of a flow is captured. Missing
more than the first 20 packets further degrades Recall to lower than 20% for all N values.
Rec
all (
%)
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
M (Number of packets offset from the beginning of each flow)
0102030405060708090
100N=10N=25
N=100N=1000
Figure 5.11: ET Recall: Classifier trained with full-flows, tested with four different slidingwindows - Naive Bayes models
Figure 5.12 depicts Precision for the Naive Bayes classifier as each sliding window moves
across the test dataset. Precision is 100% for all window sizes and M values when Recall > 0.
Classifying in the middle of game-play (M > 1000) provides a Recall close to 0%, making the
high achieved Precision somewhat meaningless. At a few points where M = 6K, 8K and 9K
Precision is 0% (caused by 0% Recall).
110 CHAPTER 5. TRAINING USING MULTIPLE SUB-FLOWS FOR REAL-TIME IPTC
Pre
cisi
on (%
)
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
M (Number of packets offset from the beginning of each flow)
0102030405060708090
100
N=10N=25
N=100N=1000
Figure 5.12: ET Precision: Classifier trained with full-flows, tested with four different slidingwindows - Naive Bayes models
Figure 5.13 summarises Recall as each sliding window moves across the test dataset for
C4.5 Decision Tree classifiers. Similar to the results seen with Naive Bayes classifiers, Recall
is quite good at the flow’s beginning for small sliding windows (80%, 92%, 92% for N = 10,
25, and 100 respectively). It even achieves 98% Recall for N = 1000 when the flows’ beginning
is captured. However, there is significant degradation in Recall for all windows if we miss the
start of the flows (dropping to approximately 50% and 25% when missing the first 20 packets).
Figure 5.14 presents Precision for the C4.5 Decision Tree classifiers as each sliding window
moves across the test dataset. For a window size of 10, 25 and 100 Precision dropped signifi-
cantly with M ≥ 10. With N = 1000, it reduced gradually from 99% to 80%; however, with a
Recall of less than 20% for M ≥ 20 packets, the high Precision becomes less meaningful.
To sum up, these results demonstrate the poor performance of the full-flow models when
classifying on small sliding windows and when missing the start of a flow.
5.4.2 Training with filtered full-flows, testing with a sliding window of N = 25 packets
Figure 5.15 shows Recall when the Naive Bayes classifier is trained using features from filtered
full-flow instances.
This model has a better Recall compared to the full-flow model when classifying with N
5.4. RESULTS AND ANALYSIS 111
Rec
all (
%)
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
M (Number of packets offset from the beginning of each flow)
0102030405060708090
100N=10N=25
N=100N=1000
Figure 5.13: ET Recall: Classifier trained with full-flows, tested with four different slidingwindows - C4.5 Decision Tree models
Pre
cisi
on (%
)
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
M (Number of packets offset from the beginning of each flow)
0102030405060708090
100
N=10N=25
N=100N=1000
Figure 5.14: ET Precision: Classifier trained with full-flows, tested with four different slidingwindows - C4.5 Decision Tree models
112 CHAPTER 5. TRAINING USING MULTIPLE SUB-FLOWS FOR REAL-TIME IPTC
Rec
all /
Pre
cisi
on (%
)
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
M (Number of packets offset from the beginning of each flow)
0102030405060708090
100RecallPrecision
Figure 5.15: ET Recall and Precision: Classifier trained on filtered full-flows, N = 25 for clas-sification - Naive Bayes models
= 25 packets. However, Recall still drops off quickly to less than 50% if we miss more than
the first 10 packets of a flow. Precision is greater than 50% if the beginning of each flow is
captured, otherwise it becomes rather poor. Precision is lower compared to the full-flow model
when Recall is greater than 0.
The increase in Recall and reduction in Precision when comparing the filtered full-flow to
the full-flow model can be explained as follows.
As shown in section 5.3.4, the training dataset for the full-flow model consists of many noise
instances, due to short flows of ET and Other traffic (≥50% of ET traffic and 90% of Other flows
are less than 25 packets). The ML algorithm needs to rule out this noise in order to identify ET
traffic. This may lead to a classification model that is over-fitted and can only identify a small
number of ET instances. In this case, the low Recall and perfect Precision (100% where Recall
> 0) of the full-flow model is a strong indicator of over-fitting.
On the other hand, this noise has already been removed from the filtered full-flow model.
The ML algorithm can now create a model that covers a larger range of ET instances. This helps
improve Recall yet creates the opportunity for false positives, which leads to lower Precision.
To demonstrate this explanation, consider the simplified example illustrated in Figure 5.16(a)
below. Due to noise in the Other traffic, the classification model created only covers a small
5.4. RESULTS AND ANALYSIS 113
range of ET instances. In Figure 5.16(b), with noises being removed (their points are faded in
the figure), the filtered full-flow model now covers a greater range of ET instances.
Due to the internal construction of a particular ML algorithm, the impact on Recall and
Precision of noise traffic may be different. With this particular use of a Naive Bayes classifier,
the removal of noise helps improve Recall yet decreases Precision as a trade-off.
Figure 5.17 summarises Recall when the C4.5 Decision Tree classifier is trained using
features from filtered full-flows.
This model has a better Recall compared to the full-flow model when classifying with N =
25 packets. Precision is also better compared to the full-flow model when Recall is greater than
0, which suggests the benefit of filtering out noise in training the C4.5 Decision Tree classifier.
However, poor Precision and Recall have been seen when the classifier misses more than 20
packets at the beginning of a flow. It seems that both Recall and Precision are better with M ≥
1000, yet they are still quite low (less than 70%)14.
In summary, both the Naive Bayes and C4.5 Decision Tree classifiers trained on filtered full-
flows perform poorly on a small sliding window and when missing the beginning of a traffic
flow.
I will go on to compare the results of the filtered full-flow model with the multiple sub-flows
and the best single sub-flow models because:
• They are all trained on flow instances with more than 25 packets.
• Between the filtered full-flow and full-flow models, the former produces better results
while the latter is considered to be over-fitted to the training set.
5.4.3 Training with individual sub-flows, testing with a sliding window of N = 25 packets
Figure 5.18 presents Recall when the Naive Bayes classifier is trained using features calculated
from 25-packet sub-flows rather than full-flows. Each model (as defined in Table 5.5) shows
interesting variation in Recall with respect to the position of the sliding window across the test
dataset. Training on a sub-flow at a particular phase of a flow tends to demonstrate higher Recall
at the same phase in the test dataset, and lower Recall otherwise.
14It is noted that the classifier trained on filtered full-flow performs worst in the transition stage when 20≤M ≤90. However, I will not investigate this particular issue further as I later show that training the classifier on multiple
114 CHAPTER 5. TRAINING USING MULTIPLE SUB-FLOWS FOR REAL-TIME IPTC
Long ET flow Short ET flow Long Other flow
Short Other flow
Region of ET traffic covered by classifiers trained on full - flow
(a) Full-flow model
Short Other flow
Long ET flow Short ET flow Long Other flow
Region of ET traffic covered by classifiers trained on filtered full - flow (where all shorts flo ws are removed)
(b) Filtered full-flow model
Figure 5.16: An illustration of creating classification rules for the full-flow and filtered full-flowmodels
5.4. RESULTS AND ANALYSIS 115
Rec
all /
Pre
cisi
on (%
)
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
M (Number of packets offset from the beginning of each flow)
0102030405060708090
100RecallPrecision
Figure 5.17: ET Recall and Precision: Classifier trained on filtered full-flows, N = 25 for clas-sification - C4.5 Decision Tree models
Rec
all (
%)
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
M (Number of packets offset from the beginning of each flow)
0102030405060708090
100
SF−0SF−20SF−40SF−2000
Figure 5.18: ET Recall: Classifier trained on 25-packet sub-flows, N = 25 for classification -Naive Bayes models
116 CHAPTER 5. TRAINING USING MULTIPLE SUB-FLOWS FOR REAL-TIME IPTC
Recall starts very high at the beginning of a flow and then drops off quickly if we miss more
than the first 10 packets for SF-0 model. On the other hand, it stays low until the sliding window
has moved beyond the early period of each flow (M ≥ 90) for SF-2000 model. Recall for the
SF-20 model is quite good even if we miss 30 or 40 packets, but eventually becomes quite
poor. The SF-40 model exhibits good overall Recall of greater than 80% with all M values15.
These results are expected as each sub-flow model presents a particular phase with distinctive
statistical properties of an ET flow during its lifetime.
Pre
cisi
on (%
)
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9KM (Number of packets offset from the beginning of each flow)
0102030405060708090
100
SF−0SF−20SF−40SF−2000
Figure 5.19: ET Precision: Classifier trained on 25-packet sub-flows, N = 25 for classification -Naive Bayes models
Figure 5.19 shows Precision for all sub-flow models. Precision drops off quickly if we miss
more than the first 10 packets of a flow for SF-0 model. On the other hand, it stays low until the
sliding window has moved beyond the early period of each flow (M ≥ 60) for SF-2000 model.
When trained on SF-20 the classifier’s Precision is quite good even if we miss 90 packets, but
eventually becomes quite poor. SF-40 model exhibits good overall Precision from 97.8% to
98.7% for all M values. These results follow the same trend with Recall. This is to be expected
because if the classifier recognises instances of both classes better (better Recall for both ET
and Other traffic), it will have better overall Precision.
sub-flows overcomes this issue and provides more stable Recall and Precision throughout the phases of a full-flow.15This suggests that perhaps the game’s transition between ‘Connecting’ and ‘In-game’ phases occurs during or
near SF-40, so this cluster contains instances of both ‘Connecting’ and ‘In-game’ statistics. For excellent Recallwe cannot use just SF-40 as will be shown later in section 5.4.4.
5.4. RESULTS AND ANALYSIS 117
Compared to the results of the full-flow and filtered full-flow models, training on a sub-flow
picked from within each original training flow (e.g. the SF-40 model) significantly improves
the classification performance, especially when M is greater than 0 (i.e. real-world scenarios
where the classifier cannot be sure it sees the start of every flow).
Figure 5.20 presents Recall when the C4.5 Decision Tree classifier is trained using features
from 25-packet sub-flows rather than full-flows.
Rec
all (
%)
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9KM (Number of packets offset from the beginning of each flow)
0102030405060708090
100
SF−0SF−20SF−40SF−2000
Figure 5.20: ET Recall: Classifier trained on 25-packet sub-flows, N = 25 for classification -C4.5 Decision Tree models
Similar to the results seen with the Naive Bayes classifier, Recall for the C4.5 Decision Tree
classifier is very high at 99.2% at the start of a flow, then drops off quickly if we miss more than
the first 20 packets for the SF-0 model. On the other hand, Recall remains low until the sliding
window has moved beyond the early period of each flow (M ≥ 90) for the SF-2000 model.
When trained on SF-40 the classifier’s Recall is quite good even if we miss more than 1,000
packets. When trained on SF-20, Recall degrades when the sliding window has moved beyond
the first 30 packets.
Figure 5.21 shows Precision for all sub-flow models (Figure 5.22 is a zoomed-in for clearer
results presentation). Similar to the Naive Bayes classifier, the SF-0, SF-20 and SF-2000 models
either have good Precision when the sliding window stays at the beginning or has moved beyond
the early period of each flow. SF-40 model maintains good Precision from 97.6% to 98.6% for
118 CHAPTER 5. TRAINING USING MULTIPLE SUB-FLOWS FOR REAL-TIME IPTC
all M values.
Pre
cisi
on (%
)
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
M (Number of packets offset from the beginning of each flow)
0102030405060708090
100
SF−0SF−20SF−40SF−2000
Figure 5.21: ET Precision: Classifier trained on 25-packet sub-flows, N = 25 for classification -C4.5 Decision Tree models
Again, compared to the filtered full-flow models, training on a sub-flow picked from within
each original training flow (e.g. the SF-40 model) significantly improves the classification per-
formance for all M values.
5.4.4 Training with multiple sub-flows, testing with a sliding window of N = 25 packets
In this section I demonstrate how a far more effective classifier can be constructed using multiple
sub-flows, which represent different time periods within the original full-flows.
Trying different combinations of the four sub-flows SF-0, SF-20, SF-40 and SF-2000, the
combination of all four sub-flows produced the best overall Precision and Recall than other
combinations of a sub-set of those four sub-flows.
Figure 5.23 shows this new classifier’s Recall, along with Recall for a classifier trained on
SF-40 (the best performed single sub-flow model) and a classifier trained on filtered full-flows
using a sliding window of 25 packets.
The multiple sub-flows curve shows very good Recall early in a flow’s life (M≤ 30) (95.7%-
98.8% compared to 83.2%-93.4% for the best single sub-flow model). For 40≤M ≤ 70 Recall
is comparable to training on the single sub-flow (91.3%-93.4% versus 89.4%-93.7% respec-
tively). For M ≥ 80 training the classifier on multiple sub-flows outweighs training on the
5.4. RESULTS AND ANALYSIS 119
Pre
cisi
on (%
)
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
M (Number of packets offset from the beginning of each flow)
70
80
90
100
SF−0SF−20SF−40SF−2000
Figure 5.22: ET Precision: Classifier trained on 25-packet sub-flows, N = 25 for classification -C4.5 Decision Tree models - a zoomed-in version of Figure 5.21
Rec
all (
%)
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
M (Number of packets offset from the beginning of each flow)
0102030405060708090
100
Filtered Full−FlowBest Single Sub−FlowMultiple Sub−Flows
Figure 5.23: ET Recall: Comparing full-flow and sub-flow training of the Naive Bayes classifier
120 CHAPTER 5. TRAINING USING MULTIPLE SUB-FLOWS FOR REAL-TIME IPTC
single sub-flow by 5%-14%. Training the classifier on filtered full-flows results in substantially
degradation in Recall compared to training on a single or multiple sub-flows.
Pre
cisi
on (%
)
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
M (Number of packets offset from the beginning of each flow)
0102030405060708090
100
Filtered Full−FlowBest Single Sub−FlowMultiple Sub−Flows
Figure 5.24: ET Precision: Comparing full-flow and sub-flow training of the Naive Bayesclassifier
Figure 5.24 shows Precision of the models. Both the best single sub-flow and multiple sub-
flows models produce greatly improved Precision compared to the filtered full-flows model.
With the analysis of traffic statistical properties in the above section for ET traffic (and in Ap-
pendix A for Other traffic), it is expected as training the classifier on features calculated from
sub-flows of the same length as the sliding window would be best to identify both ET and Other
traffic, compared to training on full-flows.
Comparing the Precision of training on a single sub-flow with multiple sub-flows (the same
method of features calculation) suggests a trade-off in Recall and Precision. Precision when
trained on the multiple sub-flows (from 89.8% to 94% for all M values) is 4.7% to 8% lower
than the Precision for the best single sub-flow model.
The improvement in Recall and reduction in Precision when comparing the multiple sub-
flows model with the best single sub-flow model can be explained as follows.
Each sub-flow’s features values form a cluster. Multiple sub-flows’ clusters may either be
overlapping or disjoint. Using a single sub-flow to train the classifier can leave out members of
other sub-flows, which are outliers to its cluster. Including those outliers in training the classifier
5.4. RESULTS AND ANALYSIS 121
improves Recall, yet on the other hand creates opportunities for false positives, which leads to
lower Precision.
To demonstrate this concept, consider the example displayed in Figure 5.25.
Other sub - flows
Single sub - flow
SF - 40 SF - 0 SF - 2000
Multiple sub - flows
Figure 5.25: An illustration of creating multiple sub-flows classifier from a number of individualsub-flows (data points are artificially created for illustrative purposes only).
As indicated in Figure 5.25, each single sub-flow forms a cluster: SF-0 forms a cluster of
pink squares, SF-40 forms a cluster of blue circles, SF-2000 forms a cluster of orange triangles,
and other points indicate members of other sub-flows. These clusters are partially overlapping
and partially disjoined. Training on a single sub-flow (e.g. SF-40 for the best single sub-flow
model) leaves out many members of other sub-flows (outliers to the SF-40 cluster). Training
on multiple sub-flows makes sure these members are included in constructing the classification
model. The classifier’s Recall, therefore is improved when the sliding window moves across
the test dataset. (This helps explain the results in Figure 5.23, where Recall for the multiple
sub-flows classifier is better during the early and later phases of the ET flows, compared to the
best single sub-flow classifier. This is due to the additional inclusion of SF-0 and SF-2000’s
members in training the multiple sub-flows model.)
On the other hand, the inclusion of these members creates a greater unwanted area, which
is the gap between the contributing clusters (indicated by the grey area in the figure), when
compared to training on a single sub-flow. The greater the unwanted area, the better the oppor-
122 CHAPTER 5. TRAINING USING MULTIPLE SUB-FLOWS FOR REAL-TIME IPTC
tunities for false positives – hence the lower Precision.
This illustration is a simple explanation of the trade-offs between Precision and Recall with
regards to the selection of sub-flows to train the classifier. This also suggests a novel approach
for an automated sub-flows selection, which will be presented in the next chapter.
It is also notable that Precision for the multiple sub-flows model slightly decreases as M
increases, especially for 1K ≤ M ≤ 9K, while Recall remains almost the same for these M
values and the same instances of Other traffic are used for testing all M values. This can be
explained as follows.
Precision for ET traffic is calculated as T PT P+FP (defined in section 2.3). FP is a constant (as
the same instances of Other traffic are used for testing all M values). Precision, therefore, only
depends on the TP of ET traffic for each test dataset.
As d(Precision)d(T P) is FP
(T P+FP)2 which is always positive, Precision is proportional to TP.
When M increases there are fewer flows longer than M+N packet. Consequently, there are
fewer ET flows for testing (as shown in Figure 5.10) and the TP for ET traffic is reduced. This
explains why Precision reduces when M increases.
Similar results have been seen with the C4.5 Decision Tree model. Figure 5.26 summarises
Recall for different classification models. For all values of M, the best single sub-flow model
and the multiple sub-flows model outperform the filtered full-flow model.
Total false positives for the multiple sub-flows models are 0.41% and 1.59% for the C4.5
Decision Tree and Naive Bayes models, respectively. Out of these percentages, most false
positives come from the P2P traffic (25% and 71.0% respectively) and Half-Life traffic (45.4%
and 10.7% respectively).
Figure 5.27 shows Precision for the different C4.5 Decision Tree classifiers. Precision held
steady at 97%-98% when trained on the multiple sub-flows. It is much better when compared
to the filtered full-flow model, and comparable to the best single sub-flow. Similar to the Naive
Bayes classifier, Precision for the multiple sub-flows model slightly decreases as M moves
further from the beginning of the flows.
To sum up, my results demonstrate that for applications with time-varying traffic character-
istics there are significant benefits to training ML classifiers using features calculated from one
(or more) sub-flows rather than full-flows.
5.4. RESULTS AND ANALYSIS 123
Rec
all (
%)
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
M (Number of packets offset from the beginning of each flow)
0102030405060708090
100
Filtered Full−FlowBest Single Sub−FlowMultiple Sub−Flows
Figure 5.26: ET Recall: Comparing full-flow and sub-flow training of the Classifier- C4.5Decision Tree models
Pre
cisi
on (%
)
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
M (Number of packets offset from the beginning of each flow)
40
50
60
70
80
90
100
Filtered Full−FlowBest Single Sub−FlowMultiple Sub−Flows
Figure 5.27: ET Precision: Comparing full-flow and sub-flow training of the C4.5 DecisionTree classifier
124 CHAPTER 5. TRAINING USING MULTIPLE SUB-FLOWS FOR REAL-TIME IPTC
5.5 Discussion
There are several notable points about my approach:
• Sub-flows that include the start of the flow usually result in more training instances than
towards the end of the training flows, due to the variation in flow length. This would give
more weight towards the start of the flows, and may create intra-class imbalance effects
(discussed in section 3.1.8).
In my experimental approach, I selected sub-flows at different phases during the flow’s
lifetime, and obtained the maximum number of instances available in each sub-flow to
train a classifier. It was demonstrated that this performed well with my datasets. An
evaluation of the classifier’s performance with intra-class balancing (such as [145]) is left
for future research.
• For different applications, the number of sub-flows required may be different. This de-
cision is dependent on the application’s statistical characteristics, especially the typical
flow lengths and variations in traffic characteristics over the flow’s lifetime.
The selection of a much greater number of sub-flows used for the application of inter-
est than for the other applications (or vice versa) may lead to the inter-class imbalance
problem (discussed in section 3.1.8).
For the illustrative scenario studied in this chapter, I consider it is more important to
correctly classify the ET class. The sensitivity of Recall and Precision when using more
sub-flows in training for ET traffic is investigated in Chapter 6.
• My analysis in this chapter depended on manual inspection of ET’s particular traffic char-
acteristics. Training a classifier for optimal recognition of another application may re-
quire an entirely different choice of sub-flows. Ideally we would like to avoid having to
manually inspect and identify the optimal set of sub-flows for each application of interest.
In the next chapter I propose utilising unsupervised classification ML algorithms to au-
tomatically identify key sub-flows within examples of an application’s full-flows. Intu-
itively this seems reasonable, since unsupervised learning algorithms identify ‘natural’
5.5. DISCUSSION 125
clustering of sub-flows, from which we may identify a set of sub-flows that represent key
statistical characteristics of the full-flow. (The existence of natural clustering of feature
values in ET was hinted at in Figures 5.2 and 5.3)
• While making use of bi-directional flows, for both training and classifying phases, I define
the forward direction as as the client-to-server direction. In practice, the classifier cannot
assume anything about the direction (whether client to server or vice-versa) of the first
packet in the N-packet sliding window at any instant (particularly if the classifier misses
some packets from the start of any given flow). The challenge of building a direction-
neutral classifier is addressed in Chapter 7.
• Depending on the particular application we are trying to classify and the particular ML
algorithm, there will be a trade-off between keeping N low (for timely classification and
reduced memory consumption) and keeping N high (for acceptable Recall and Precision).
A very short window may not be good enough to differentiate between different appli-
cations. A large window may improve the classifier’s Precision and Recall, yet increase
the time required to collect enough packets’ statistics before a classification decision is
made. For example, I performed similar comparisons using N=10. For the Naive Bayes
classifier, the median Recall and Precision were 24.3% and 5% lower than for N =25
respectively. For the C4.5 Decision Tree classifier, the median Recall and Precision were
0.5% and 5% lower than for N = 25 respectively. Detailed analysis of this trade-off is a
subject for future research.
• Since Recall and Precision rarely reach 100%, in continuous classification there is possi-
bility of flapping (oscillation) in classification results when monitoring traffic flows over
their lifetime. This can be overcome by applying a scheme to verify the classification
result before taking further action. For example, the classifier would only send an update
of the flow classification result if it sees two new, consecutive and identical results 16.
This technique was applied and demonstrated to work well in [76].
16The classifier then has hysteresis included, so that the result is sustained for a longer duration, with noisesuppression during the steady state (e.g. [183])
126 CHAPTER 5. TRAINING USING MULTIPLE SUB-FLOWS FOR REAL-TIME IPTC
There are also a number of limitations in my current experimental approach. Further im-
provement can be made in the following areas:
• My training dataset used a mixture of traffic at different locations. This approach is
practical when examples of traffic collected at a single monitor point are not sufficient
for learning. This does not affect my results as it reserves the inter-flow characteristics of
samples traffic used for training. However, the portability of the trained classifier should
be evaluated 17.
• A limited number of common interference applications are used for training the classifier
model. Extending the training dataset for the Other class is subject to future research.
• There is another FPS game (Half-Life) in my dataset for the Other traffic, which ac-
counted for less than 10% of traffic mix for training and less than 0.3% of traffic mix for
testing (as shown in section 5.3.4). With such a small amount of traffic, the negative im-
pact of this FPS game on my Precision results is insignificant (presented in section 5.4.3
and 5.4.4). However, the proportion of Half-Life traffic among the false positives in my
results suggests that an inclusion of other FSP games traffic (which has similar character-
istics as ET [176]) in the Other class can degrade the Precision of ET traffic classification.
The classifiers should be trained with more examples of these other FPS games traffic and
re-evaluated in that case. The separation of ET from other FPS games traffic is a subject
for future work.
• Sub-flows used to train the Other class are not being optimised. One reason for this is
that most of the Other class’s examples of flows are short. Further investigation on which
sub-flows are best for the Other class to train the classifier may lead to better results.
• The test dataset is constructed with a static selection of sliding window positions. Ex-
haustive testing of the classifier models throughout the flow’s lifetime would be ideal.
17Results from [76] demonstrate that the constructed classification model in this thesis’ work performs well onlive capture of ET traffic in a lab environment.
5.6. CONCLUSION 127
5.6 Conclusion
In this chapter I have proposed a novel solution: the ML classifier should be trained using
statistical features calculated from multiple short sub-flows extracted from full-flows generated
by the target application. The sub-flows are selected from regions of the application’s full-flows
that have noticeably different statistical characteristics.
I show that this can significantly improve a classifier’s performance when using a small
sliding window, regardless of how many packets are missed from each flow’s beginning. My
proposal is illustrated by constructing, training and testing Naive Bayes and C4.5 Decision
Tree classifiers for the detection of Wolfenstein Enemy Territory online game traffic. With this
particular scenario good results were found when trained on four sub-flows and using a sliding
window of only 25 packets.
Focusing on the identification of ET traffic, while having thousands of full-flow samples
for most interfering applications, for some applications I only have a few hundred (as detailed
in section 5.3.4). For a better classification model (in terms of Precision) we need to train the
classifier with the presence of a larger and more diverse collection of interfering traffic. This
endeavour is left for future work.
In this chapter, representative sub-flows used for training were manually selected. Training
a classifier for a new application may require an entirely different set of sub-flows. This step
should ideally be done automatically, without requiring expert knowledge of the application of
interest. Furthermore, for both training and classifying phases, I ensure that the forward direc-
tion is defined as the client-to-server direction. In the next chapter, I propose novel approaches
that supplement my training method to overcome the issue of automated sub-flows selection
and the problem of directionality.
Chapter 6
Automated Sub-Flow Selection usingUnsupervised Clustering Techniques
6.1 Introduction
In Chapter 5 I presented a novel approach to ML-based IPTC with two unique characteristics:
the classifier is trained on multiple short sub-flows (each sub-flow being a fragment of N consec-
utive packets taken from different points within the original application flow’s lifetime); and the
classification decision process is repeated continuously on a sliding window of the most recent
N packets seen by the classifier. This allows my classifier to accurately identify applications
whose traffic statistics change over time.
A crucial step is the a priori identification of appropriate sub-flows to train the classifier.
These sub-flows must cover all possible phases of the full-flow during its lifetime for consistent
and stable classification. For applications with well understood traffic characteristics, this can
be done based on the domain knowledge of an expert. For example, the initial handshake of
an SMTP connection may look quite different to the traffic while transferring the body of each
email; hence sub-flows should be taken at the beginning and middle of the flow. However,
training a classifier for a new application may require an entirely different set of sub-flows.
Ideally the identification of sub-flows would be done automatically, without the need for
expert knowledge about the application of interest. It is also ideal to eliminate the need to
manually handle the complexity of data analysis in studying the application’s traffic. In this
chapter I propose and demonstrate an automated approach that uses clustering ML techniques
to select sub-flows for training.
128
6.2. MY PROPOSAL 129
My approach firstly identifies sub-flows that are subsets of each full-flow’s packets passing
the classifier, starting from the beginning to the end of the full-flow. However, training the
classifier using all sub-flows found in this step may require a greater processing overhead. The
next step is to select only a limited number of representative sub-flows to train the classifier.
These are sub-flows that best capture distinctive statistical variation of a full-flow during its
lifetime. This step is important to minimise the load on the classifier, both during training and
classification, yet still meets the requirements of maintaining accurate classification. It is this
step that I propose to automate through the use of clustering ML techniques.
To demonstrate my proposal, I use the same hypothetical scenario as in Chapter 5 where ET
application traffic needs to be identified. I use the Expectation Maximisation (EM) algorithm
[161] for automated selection of sub-flows, and the Naive Bayes and C4.5 Decision Tree su-
pervised learning ML algorithms for subsequent traffic classification. The classifiers built using
the proposed approach are evaluated using accuracy and computational performance metrics.
This chapter is organised as follows. Section 6.2 introduces my proposal. Section 6.3
describes my experimental approach. The results and analysis are presented in section 6.4.
Finally, the chapter is concluded in section 6.7 with some final remarks and suggestions for
future work.
6.2 My proposal
The sub-flows identification and selection to train a classifier can be described in two steps as
follows:
1. Sub-flow identification: Extract two or more sub-flows from every flow that represents
the class of traffic one wishes to identify in the future.
2. Sub-flow selection: Examine the extracted sub-flows to select a number of representa-
tive sub-flows that best capture distinctive statistical characteristics of the application of
interest (e.g. at the start and middle of the flow).
The purpose of Step 1 is to find all possible sub-flows to train the classifier. The only crucial
requirement is that the step must cover all possible phases of the application’s flows during their
lifetime.
130 CHAPTER 6. CLUSTERING FOR AUTOMATED SUB-FLOW SELECTION
Step 2, which involves the selection of representative sub-flows among all sub-flows found
in Step 1, can be a challenging task in practice. Each sub-flow’s instance is represented by the
values of multiple features, which results in multi-dimensional datasets to be examined.
The following sub-sections elaborate on how we can automate these two steps.
6.2.1 Step 1 - Sub-flow identification
I propose the following approach to automate Step 1 of sub-flow identification:
• Choose a window size N and step fraction S.
• Starting at the first packet, slide across the training dataset in steps of S packets (for
example, 12N), creating sub-flows of N consecutive packets at each step.
The same positions of sub-flows 1 are used for all full-flows in the data trace. This is pro-
posed because: sub-flow instances selected at the same position with regards to the full-flow
should share similar statistical properties; each position will give us a collection of instances to
study, representing the specific phase of the full-flow’s lifetime.
The positions of sub-flows are selected based on the chosen values of N and S. With a
suitable value of S, we can cover all full-flows’ phases. This coverage is better than randomly
selecting the position of sub-flows. One potential drawback is that it may lead to a greater
number of sub-flows identified and thus higher computational processing cost (where S and N
are small). This approach can also be applied for flows that exhibit periodic characteristics.
This approach can be illustrated as follows. A data trace with multiple full-flow instances
will result in a set of sub-flows with the same offset position from the beginning of each full-
flow. SF-M denotes the set of sub-flows positioned M taken from Q full-flows instances in the
data trace: SF-M = {SF-M0, SF-M1, SF-M2,..., SF-MQ−1}. SF-M is called a sub-flow class (or
sub-flow for short). SF-M0, SF-M1, SF-M2,..., SF-MQ−1 are members of sub-flow SF-M, or
sub-flow SF-M’s instances. Figure 6.1 illustrates how sub-flow instances are identified within
a single full-flow. S = 12N is chosen in this example. K sub-flows are identified for a full-flow
instance Fi (0≤ i≤ Q), namely SF-0i, SF-[12N]i, ..., SF-[K−1
2 N]i.
1Referring to Figure 5.1, the position of a sub-flow is indicated by the number of packets offset from thebeginning of the original full-flow.
6.2. MY PROPOSAL 131
Figure 6.1: An illustration of the sub-flow identification step
Depending on the choice of S, sub-flows may overlap. N is chosen to reflect the lower bound
on classification timeliness (as discussed in section 5.5). The choice of S influences the number
of sub-flows instances identified, which subsequently affects the processing overhead for sub-
flow selection in Step 2. The flow length may vary, which results in a variation in the number
of sub-flows identified per full-flow, and the number of instances for each sub-flow class. The
typical flow length of the application should therefore be taken into account when choosing
suitable N and S values. The optimisation of N and S is implementation specific.
6.2.2 Step 2 - Sub-flows selection
Selection of sub-flows is automated through the use of clustering ML techniques. This novel
approach is motivated by the analysis of ET traffic’s statistical properties in Section 5.3.3. The
scatter plots of Figures 5.2 and 5.3 hinted that feature values calculated for sub-flow instances
naturally form into a number of clusters. A cluster may contain members of different sub-
flow classes yet having similar statistical characteristics. This suggests that the classifier may
be trained using only a subset of each cluster’s members instead of using the whole cluster’s
population. Representative members for a cluster, for example, can be members of the sub-flow
class that dominates the cluster’s population.
The key now is to search for the clusters among the sub-flows’ instances. To search in multi-
132 CHAPTER 6. CLUSTERING FOR AUTOMATED SUB-FLOW SELECTION
dimensional datasets, ML clustering techniques appear to be the best tools. An unsupervised
clustering algorithm identifies ‘natural’ clusters among the initial set of sub-flows from Step 1,
from which we can then select a set of sub-flows that represent key statistical characteristics of
the application’s traffic.
Once the clusters are found, a subset of their members must be chosen to represent the
clusters in training the classifier 2. How to choose representative members for a cluster is a
choice to be made during implementation. For example, one can choose to use members of one
sub-flow class that dominates the cluster, or members of more than one sub-flow per cluster to
train the classifier. The choice again involves trade-offs between required classification model
build time, and classification speed and accuracy of the built classification model.
Figure 6.2 illustrates this concept 3.
SF-80
SF-0 SF-10 SF-20 SF-30 SF-40
SF-60 SF-70
Cluster 0
SF-90 SF-50
Cluster 2
Cluster 1
Figure 6.2: An illustration of selecting representative sub-flows to train a classifier
In this example, Step 1 identified 10 sub-flow classes, namely SF-0, SF-10, ..., SF-90. Mem-
bers of these sub-flows form into three clusters: Cluster 0 contains members of SF-0, SF-10,
SF-20 and SF-30; cluster 1 contains members of SF-30, SF-40, SF-50, SF-60 and SF-70; clus-2Sub-flows entered into a clustering algorithm are labelled for post analysis of sub-flow selection. The cluster-
ing process itself is unsupervised.3Data points given in this example are created for illustrative purposes only. They are not the actual data drawn
from my ET datasets.
6.3. AN EXPERIMENTAL ILLUSTRATION OF MY PROPOSAL 133
ter 2 contains members of SF-40, SF-80 and SF-90. These three clusters demonstrate three
phases in which the application’s traffic flows have noticeably distinctive statistical properties
over their lifetime. Each cluster is then examined to determine which sub-flow’s members con-
tribute the majority value of the class attribute (dominating) within the cluster. It might happen
that, for this particular example, members of SF-0 dominate Cluster 0’s population, members of
SF-50 dominate Cluster 1’s population, and SF-80’s members dominate Cluster 2’s population.
In this case, SF-0, SF-50 and SF-80 are selected as representative sub-flows for Cluster 0, 1 and
2 respectively. In this simplified example, all members of SF-0 belong to a single cluster, as do
the members of SF-50 and SF-80. Members of SF-0, SF-50 and SF-80 are then used to train
the classifier.
Using only the most dominant sub-flow class of each cluster reduces the computational load
when training the classifier. It also ensures that the sub-flows chosen cover all critical phases of
the application’s flows during their lifetime.
In practice, members of a sub-flow may belong to more than one cluster, and one sub-
flow can dominate more than one cluster. This will lead to different implementation options in
choosing the representative sub-flow members to train the classifier.
6.3 An experimental illustration of my proposal
To illustrate my proposal I use the same scenario as described in Chapter 5 : a real-time Naive
Bayes/ C4.5 Decision Tree classifier must accurately identify Wolfenstein Enemy Territory traf-
fic mixed in amongst unrelated, interfering traffic. Flow definition and feature set are the same
as that defined in Section 5.3.1.
The EM [161] clustering algorithm is chosen for automated sub-flow selection, with WEKA
implementation [175]. The EM algorithm is described in section 3.1.4. It has been used to
cluster IP traffic flows in previous studies, such as [59] and [60].
6.3.1 Step 1 - Sub-flow identification
As in Chapter 5, I use N = 25 packets for the sliding classification window. The identification
of sub-flows in Step 1 is carried out as follows. I divided the full-flow into two phases, the
‘earlier phase’ and the ‘later phase’, and selected a number of sub-flows for each phase. Let M
134 CHAPTER 6. CLUSTERING FOR AUTOMATED SUB-FLOW SELECTION
be the number of packets offset from the beginning of each flow in the dataset. Sub-flows for the
‘earlier phase’ started at M = 0, increasing by steps (S) of 10 packets, until M = 90. Sub-flows
for the ‘later phase’ phase started at M = 2000 and increased by steps (S) of 1000 packets until
M = 9000 4. This step resulted in 18 different sub-flows starting at different points within the
full-flow lifetime. Instances of these sub-flows are labelled (for post-clustering analysis only)
then submitted to the EM Clustering ML algorithm for the Step 2 process.
Full Flow
SF - 0 N
N SF - 10
N N
N
N
(Later phase)
SF - 90
(Earlier phase)
SF - 2000
SF - 3000
SF - 9000
Figure 6.3: Step 1 - Experimental approach
The ET trace described in section 5.3.4 is used for the analysis of ET’s statistical properties
and sub-flows selection. With this dataset, the 18 sub-flows results in a total of 23,875 instances,
which will be used as input to the clustering process in Step 2. Figure 6.4 presents in detail the
number of instances per sub-flow identified in this step.
6.3.2 Step 2 - Sub-flow selection
In the WEKA implementation of EM, one can specify the desired number of clusters oneself or
leave it to the tool to determine the optimal number of clusters (see section 3.1.4). The optimal
number of clusters is the one that produces the highest estimated log-likelihood–a measure of
the goodness of the clustering–which denotes the likelihood that the data originates from the
clustering models, given the values of the estimated parameters.
4With my current implementation of features calculations and choice of clustering algorithm, using a small Svalue consistently in identification of sub-flows would result in an enormous processing overhead. I chose to use asimplified approach with different S values for the earlier and later phases of a flow’s lifetime.
6.3. AN EXPERIMENTAL ILLUSTRATION OF MY PROPOSAL 135
Num
ber o
f ins
tanc
es
0 10 20 30 40 50 60 70 80 90 2K 3K 4K 5K 6K 7K 8K 9K
050
010
0015
0020
0025
00
M (Number of packets offset from the beginning of each flow)
Figure 6.4: Number of instances for each sub-flow identified in Step 1
This optimal number of clusters can produce an optimal cluster model, which can subse-
quently lead to an accurate classifier. However, it may not necessarily produce an optimal
classifier, which involves the trade-offs between classification accuracy, classification speed
and computational complexity. The choice of the number of clusters, therefore, may need to be
evaluated against the performance of the corresponding classifier built on the clustering results.
Consequently, there are two options available for choosing the optimal number of clusters. I
refer to the first option as pre-classification. This optimisation process begins with one cluster,
and continues to add clusters until the estimated log-likelihood can no longer be increased. I
refer to the second option as post-classification. This technique begins with one cluster, and
continues to add clusters until there can be no more increase in the estimated performance of
the classifier trained and tested using each different number of clusters.
Although the post-classification option may produce the optimal classifier, the optimisation
process is dependent on the ML classification algorithm used in the classifier. This process
can also be very complex and computationally expensive. A range of factors (such as Recall,
Precision, classification speed, classification model build time, and physical resource require-
ments) need to be taken into consideration. It is still more challenging to automate the whole
136 CHAPTER 6. CLUSTERING FOR AUTOMATED SUB-FLOW SELECTION
optimisation process.
In this chapter, with the aim of making the clustering process simple, fully automated and
independent of the ML classification algorithm used by the classifier, I choose to use the optimal
number of clusters found by the pre-classification option. My research as outlined in Appendix
E, which evaluates the performance of the classifiers using both options, indicates that the pre-
classification option can produce a classifier that possesses high accuracy with only small trade-
offs in required classification model build time and memory usage.
To select representative sub-flows for each cluster, I use WEKAs ‘classes to clusters evalua-
tion mode. First, this mode ignores the sub-flow label class attribute and generates the clusters.
Then during the test phase it assigns a sub-flow label to each cluster5.
Figure 6.5: Sub-flow to cluster mapping and evaluation.
5In this mode, all possibilities of the class-to-cluster assignment are tried. The total number of incorrectlyclustered instances–compared to the label for the instances in the training set, called the classification error–iscomputed for each assignment. The class-to-cluster assignment with the smallest classification error will be chosen[124]. The actual classification error, however, is not important as it is expected that a sub-flows instance can beclassified into clusters with different sub-flow class labels.
6.3. AN EXPERIMENTAL ILLUSTRATION OF MY PROPOSAL 137
Figure 6.5 summarises the results of the clustering process. The EM algorithm found eight
clusters of sub-flows from the 18 sub-flows found in Step 1. From this it assigned eight sub-
flows classes to map to these eight clusters. I use these eight sub-flows as representative sub-
flows to train my Naive Bayes and C4.5 Decision Tree classifiers. The sub-flows chosen include
sub-flows SF-0, SF-10, SF-20, SF-30, SF-40, SF-50, SF-60, and SF-3000, with a total of 12,804
instances (approximately half of total instances for all 18 sub-flows found in Step 1) 6.
6.3.3 Evaluation of classifiers trained with sub-flows selected by EM
With the selected sub-flow classes in Step 2, Naive Bayes and C4.5 Decision Tree classifiers
are built and tested using the same method as described in section 5.3.4 for multiple sub-flows
classifiers. I call this an Automatically selected multiple sub-flows (MultiSFs-AutoSel) classifier.
The performance of MultiSFs-AutoSel is evaluated, and compared with other classifiers trained
by different approaches, including:
• Full-flow classifier (Full-flow): as defined in Chapter 5.
• Filtered full-flow classifier (Filtered full-flow): as defined in Chapter 5.
• Manually selected multiple sub-flows classifier (MultiSFs-ManualSel): The four sub-
flows selected to build the multiple sub-flows classification model as outlined in Chapter
5.
• All found multiple sub-flows classifier (MultiSFs-AllFound): All sub-flows identified in
Step 1 are used to train the classifier.
Different training approaches lead to the differences in the number of training instances for
each classifier. This is summarised in Table 6.1.
These classifiers are compared for Accuracy (based on Precision and Recall as defined in
Section 2.3) and Computational performance. Computational performance is evaluated using
the three sub-metrics:
• Model build time: The CPU time required to train a classifier.
6This is justified as acceptable for my study because I will have a collection of subsets of all clusters to trainmy classifiers. This meets the requirements of Steps 1 and 2 of my proposed approach.
138 CHAPTER 6. CLUSTERING FOR AUTOMATED SUB-FLOW SELECTION
Table 6.1: The differences in training instances for each classifierClassifier Training Instances
Full-flow One full-flow in the data trace results in one instance to train theclassifier.
Filtered full-flow One full-flow (greater than 25 packets long) in the data trace resultsin one instance to train the classifier.
MultiSFs-ManualSel One full-flow in the data trace results in up to 7 four instances totrain the classifier.
MultiSFs-AutoSel One full-flow in the data trace results in up to eight instances totrain the classifier.
MultiSFs-AllFound One full-flow in the data trace results in up to eighteen instances totrain the classifier.
• Classification speed: The number of classifications that can be performed in each CPU
second.
• Memory usage: Memory usage for building the classification model and classifying using
the built model.
In addition, I study the clustering time, which refers to the CPU time required for the clus-
tering process.
This experiment was run on the Swinburne supercomputer cluster (described in Section
5.3.5). The physical resource consumption (CPU time and memory usage) was tracked using
Qstat [184].
6.4 Results and analysis
Figure 6.6 shows the normalised number of instances used to train each classifier. A value
of 1 represents the highest number of instances (91,641 instances for the Full-flow classifier).
The Filtered full-flow classifier has the smallest number of instances, as one full-flow resulted
in one instance for training and all flows of shorter than 25 packets have been filtered out as
described in Chapter 5. With the three classifiers trained on multiple sub-flows, the more sub-
flows selected, the greater number of instances to train the classifier. The MultiSFs-AllFound
model has the highest number of training instances among the three, followed in order by the
MultiSFs-AutoSel and the MultiSFs-ManualSel classifiers. The number of instances used to
6.4. RESULTS AND ANALYSIS 139
train a classifier directly affects the time taken to build the classification model, as revealed in
the results section below.
Full−
flow
Filte
red
full−
flow
Mul
tiSFs
−Man
ualS
el
Mul
tiSFs
−Aut
oSel
Mul
tiSFs
−AllF
ound
Nor
mal
ised
Num
ber o
f Tra
inin
g In
stan
ces
00.
20.
40.
60.
81
Figure 6.6: Normalised number of instances in training each classifier
6.4.1 Accuracy
Figure 6.7 depicts the Recall for each of the Naive Bayes classifiers for 19 positions of the
sliding window with the test dataset (detailed in Section 5.3.4). The results are presented using
boxplots 8.
Consistent with the results seen in Chapter 5, Full-flow and Filtered full-flow classifiers
result in very low Recall when classifying traffic using the sliding window. All classifiers trained
with multiple sub-flows produce greater than 98% Recall.
Figure 6.8 is a zoomed-in version of Figure 6.7, to enable a more precise comparison of
the MultiSFs-ManualSel, MultiSFs-AutoSel and MultiSFs-AllFound classifiers. Among these
three classifiers, the MultiSFs-AutoSel classifier has the highest median Recall of 99%, fol-
8The black line in the box indicates the median; the bottom and top of the box indicates the 25th and 75th per-centile, respectively. The vertical lines drawn from the box are whiskers. The upper cap is the largest observationthat is ≤ to the 75th percentile + 1.5*IQR (interquartile range - essentially the length of the box). The lower capis the smallest observation that is ≥ the 25th percentile - 1.5*IQR. Any observations beyond the caps are drawn asindividual points, and indicate outliers.
140 CHAPTER 6. CLUSTERING FOR AUTOMATED SUB-FLOW SELECTION
020
4060
8010
0
Rec
all(%
)
Full−
flow
Filte
red
full−
flow
Mul
tiSFs
−Man
ualS
el
Mul
tiSFs
−Aut
oSel
Mul
tiSFs
−AllF
ound
Figure 6.7: Recall for Naive Bayes classifiers trained on various selections of full-flows andsub-flows
9092
9496
9810
0
Rec
all(%
)
Mul
tiSFs
−Man
ualS
el
Mul
tiSFs
−Aut
oSel
Mul
tiSFs
−AllF
ound
Figure 6.8: Recall for Naive Bayes classifiers using multiple sub-flows, expanded from Figure6.7
6.4. RESULTS AND ANALYSIS 141
lowed by the MultiSFs-AllFound classifier with a median Recall of 98.5% and the MultiSFs-
ManualSel with a median Recall of 98.3%. The differences in Recall among these three clas-
sifiers are small, at less than 1%. However, the slightly higher Recall of the MultiSFs-AutoSel
classifier compared to the MultiSFs-ManualSel classifier suggests that even without the expert
knowledge, clustering ML techniques can effectively assist the selection of sub-flows that cover
all distinct phases of the application’s flows during their lifetime. The MultiSFs-AutoSel clas-
sifier, in addition, has the smallest gap between the 25th and 75th percentile, which suggests a
better consistency in the classification’s Recall for all positions of the sliding window consid-
ered.
Figure 6.9 presents the Precision for each of the Naive Bayes classifiers. While the Full-flow
classifier shows the maximum median Precision of 100%, it is an indication of an over-fitting
problem (discussed in section 5.4). The high Precision of this classifier does not have much
meaning due to the low Recall achieved. Consistent with the results presented in Chapter 5,
the Filtered full-flow classifier has a low median Precision of 12.7%. All classifiers using the
multiple sub-flows training approach achieve higher than 91% Precision.
020
4060
8010
0
Pre
cisi
on(%
)
Full−
flow
Filte
red
full−
flow
Mul
tiSFs
−Man
ualS
el
Mul
tiSFs
−Aut
oSel
Mul
tiSFs
−AllF
ound
Figure 6.9: Precision for Naive Bayes classifiers trained on various selections of full-flows andsub-flows
Figure 6.10 is an expanded version of Figure 6.9 and focuses only on comparison of the
142 CHAPTER 6. CLUSTERING FOR AUTOMATED SUB-FLOW SELECTION
three classifiers trained on multiple sub-flows. Among the three classifiers, the MultiSFs-
AllFound classifier achieves the highest median Precision (94.9%), followed by the MultiSFs-
AutoSel (93.3%) and the MultiSFs-ManualSel (91.9%) classifiers. The MultiSFs-AutoSel clas-
sifier achieves slightly higher results in both Precision and Recall than the MultiSFs-ManualSel
classifier, which suggests that the automated sub-flows selection approach positively assists
in building a more accurate classifier. Using all found sub-flows results in lower Recall and
higher Precision of the MultiSFs-AllFound classifier compared to the MultiSFs-AutoSel classi-
fier, suggesting the possibility of over-fitting (similar to the case illustrated in Figure 5.16(a)).
The trade-off between Precision and Recall with regard to the number of clusters (hence the
number of sub-flows to train a classifier) is presented in Appendix E.
9092
9496
9810
0
Pre
cisi
on(%
)
Mul
tiSFs
−Man
ualS
el
Mul
tiSFs
−Aut
oSel
Mul
tiSFs
−AllF
ound
Figure 6.10: Precision for Naive Bayes classifiers using multiple sub-flows, expanded fromFigure 6.9
Figure 6.11 shows the Recall for each C4.5 Decision Tree classifier. Consistent with the
results seen in Chapter 5, Full-flow and Filtered full-flow classifiers result in very low Recall
when classifying traffic using the sliding window. All classifiers trained on multiple sub-flows
produce greater than 98% median Recall.
The differences among the three classifiers trained on multiple sub-flows are small, at less
than 0.5%. The MultiSFs-AllFound classifier has the highest median Recall of 98.9%, followed
6.4. RESULTS AND ANALYSIS 143
020
4060
8010
0
Rec
all(%
)
Full−
flow
Filte
red
full−
flow
Mul
tiSFs
−Man
ualS
el
Mul
tiSFs
−Aut
oSel
Mul
tiSFs
−AllF
ound
Figure 6.11: Recall for C4.5 Decision Tree classifiers trained on various selections of full-flowsand sub-flows
by the MultiSFs-AutoSel (98.7%) and the MultiSFs-ManualSel (98.4%) classifiers.
Figure 6.12 summarises the Precision for each of the C4.5 Decision Tree classifiers. Con-
sistent with the results shown in Chapter 5, the Full-flow and Filtered full-flow classifiers have
low median Precision at less than 62%. All classifiers trained with multiple sub-flows achieve
a median Precision of higher than 97%.
Among the three classifiers trained on multiple sub-flows, the MultiSFs-ManualSel and
MultiSFs-AutoSel classifiers achieve an almost identical median Precision of 97.8%. The
MultiSFs-AllFound classifier achieves a slightly lower Precision, with a median of 97.5%.
However, the differences in Precision achieved by all three classifiers trained on multiple sub-
flows are less than 0.5%. The three classifiers achieved similar levels of consistency in Recall
and Precision with the 19 positions of the sliding window tested.
To sum up, my results indicate that manual selection of sub-flows for training is not nec-
essary in the general cases. My datasets even demonstrate that slightly better Precision and
Recall can be achieved using the clustering technique for automated sub-flow selection. Using
all sub-flows identified results in similar Precision and Recall to using only sub-flows automat-
ically selected by the EM algorithm. Also the C4.5 Decision Tree classifiers achieved higher
144 CHAPTER 6. CLUSTERING FOR AUTOMATED SUB-FLOW SELECTION
020
4060
8010
0
Pre
cisi
on(%
)
Full−
flow
Filte
red
full−
flow
Mul
tiSFs
−Man
ualS
el
Mul
tiSFs
−Aut
oSel
Mul
tiSFs
−AllF
ound
Figure 6.12: Precision for C4.5 Decision Tree classifiers trained on various selections of full-flows and sub-flows
Precision and Recall than the Naive Bayes classifiers across all tests.
6.4.2 Computational performance
This section compares the classifiers in terms of computational performance. This evaluation
is important considering real-time classification of potentially thousands of simultaneous traffic
flows. Each experiment was repeated three times with the median of all three runs taken to
represent each experiment.
Figure 6.13 compares the normalised build time for each of the Naive Bayes classifiers. A
value of 1 represents the slowest build time (214.47 seconds in the supercomputer environment
described earlier).
As expected, the larger the number of training instances, the longer the time required to
construct a classification model. As shown in Figure 6.13, the Full-flow classifier has the longest
required model build time as it contains the largest number of training instances. The Filtered
full-flow classifier has the shortest required model build time. The MultiSFs-AutoSel classifier
has shorter required model build time than the MultiSFs-AllFound classifier (∼15% less), and
slightly longer model build time than the MultiSFs-ManualSel classifier (∼9% more).
Figure 6.14 depicts the normalised classification speed for the classifiers. A value of 1
6.4. RESULTS AND ANALYSIS 145
Full−
flow
Filte
red
full−
flow
Mul
tiSFs
−Man
ualS
el
Mul
tiSFs
−Aut
oSel
Mul
tiSFs
−AllF
ound
Nor
mal
ised
Cla
ssifi
catio
n M
odel
Bui
ld T
ime
00.
20.
40.
60.
81
Figure 6.13: Normalised build time for Naive Bayes classifiers
represents the fastest classification speed (4,051 classifications per second). The Full-flow clas-
sifier has the fastest classification speed, followed by the Filtered full-flow classifier which is
∼5% slower. Among the three classifiers trained on multiple sub-flows, the MultiSFs-AutoSel
classifier achieves the highest classification speed, ∼2% and 4% higher than the speed of the
MultiSFs-AllFound and MultiSFs-ManualSel classifiers, respectively.
Figure 6.15 outlines the normalised memory usage for the Naive Bayes classifiers while
performing 10-times cross-validation [109] of their training dataset. A value of 1 represents
the most memory consumption (552MB). Although all classifiers consume quite low mem-
ory resources, the Full-flow classifier consumes the most resources, followed by the MultiSFs-
AllFound classifier. The MultiSFs-AutoSel classifier is in the middle range of memory usage
compared to the other classifiers (∼15% less than the MultiSFs-AllFound and comparable to
the MultiSFs-ManualSel classifiers). It seems that memory usage is proportional to the required
model build time. The longer the time taken to build the model, the greater the memory usage.
Figure 6.16 compares the normalised build time for each of the C4.5 Decision Tree clas-
sifiers. A value of 1 represents the slowest build time (450.87 seconds on our test platform).
Similar to the results of the Naive Bayes classifiers, the Full-flow classifier has the longest model
build time. The Filtered full-flow classifier has the shortest model build time. The MultiSFs-
146 CHAPTER 6. CLUSTERING FOR AUTOMATED SUB-FLOW SELECTION
Full−
flow
Filte
red
full−
flow
Mul
tiSFs
−Man
ualS
el
Mul
tiSFs
−Aut
oSel
Mul
tiSFs
−AllF
ound
Nor
mal
ised
Cla
ssifi
catio
n S
peed
00.
20.
40.
60.
81
Figure 6.14: Normalised classification speed for Naive Bayes classifiers
Full−
flow
Filte
red
full−
flow
Mul
tiSFs
−Man
ualS
el
Mul
tiSFs
−Aut
oSel
Mul
tiSFs
−AllF
ound
Nor
mal
ised
Mem
ory
Usa
ge
00.
20.
40.
60.
81
Figure 6.15: Normalised memory usage for Naive Bayes classifiers while performing 10-timescross validation (during both training and testing)
6.4. RESULTS AND ANALYSIS 147
AutoSel required classifier’s build time is 29% less than that of the MultiSFs-AllFound classi-
fier, and ∼ 15% longer than the required build time for the MultiSFs-ManualSel classifier.
Full−
flow
Filte
red
full−
flow
Mul
tiSFs
−Man
ualS
el
Mul
tiSFs
−Aut
oSel
Mul
tiSFs
−AllF
ound
Nor
mal
ised
Cla
ssifi
catio
n M
odel
Bui
ld T
ime
00.
20.
40.
60.
81
Figure 6.16: Normalised build time for C4.5 Decision Tree classifiers
Figure 6.17 presents the normalised classification speed for the C4.5 Decision Tree classi-
fiers. A value of 1 represents the fastest classification speed (15,402 classifications per second).
The Full-flow classifier has the fastest classification speed, followed by the Filtered full-flow
classifier which is ∼7% slower. Among the three classifiers trained on multiple sub-flows, the
MultiSFs-AutoSel classifier is∼3% slower compared to the MultiSFs-ManualSel classifier, and
∼10% faster than the MultiSFs-AllFound classifier.
Figure 6.18 shows the normalised memory usage for the classifiers while performing 10
times cross validation [109] of their training dataset. A value of 1 represents the most memory
consumption (128MB). Although all classifiers consume relatively low memory resources, the
Full-flow classifier consumes the most resources, followed by the MultiSFs-AllFound classifier.
The MultiSFs-AutoSel classifier consumes ∼15% less than the MultiSFs-AllFound classifier,
and ∼10% more than the MultiSFs-ManualSel classifier.
6.4.3 Summary of results
These results suggest that in general, training on multiple sub-flows is significantly more effec-
tive than the traditional full-flow training approach in terms of Precision and Recall, required
148 CHAPTER 6. CLUSTERING FOR AUTOMATED SUB-FLOW SELECTION
Full−
flow
Filte
red
full−
flow
Mul
tiSFs
−Man
ualS
el
Mul
tiSFs
−Aut
oSel
Mul
tiSFs
−AllF
ound
Nor
mal
ised
Cla
ssifi
catio
n S
peed
00.
20.
40.
60.
81
Figure 6.17: Normalised classification speed for C4.5 Decision Tree classifiers
Full−
flow
Filte
red
full−
flow
Mul
tiSFs
−Man
ualS
el
Mul
tiSFs
−Aut
oSel
Mul
tiSFs
−AllF
ound
Nor
mal
ised
Mem
ory
Usa
ge
00.
20.
40.
60.
81
Figure 6.18: Normalised memory usage for C4.5 Decision Tree classifiers while performing10-times cross-validation
6.4. RESULTS AND ANALYSIS 149
model build time and physical resources usage, with a slight trade-off in terms of classification
speed.
For the comparison among the three classifiers trained on multiple sub-flows, Figure 6.19
summarises their median Precision and Recall results.
Recall.NB Precision.NB Recall.DT Precision.DT
Pre
cisi
on/R
ecal
l
MultiSFs−AutoSel MultiSFs−AllFound MultiSFs−ManualSel
020
4060
8010
0
Figure 6.19: Summary of Precision / Recall results for Naive Bayes (NB) and C4.5 DecisionTree (DT) classifiers trained on multiple sub-flows
The results of training the Naive Bayes classifiers on the sub-flows selected automatically
by the EM algorithm include:
• Highest Recall (0.5% and 0.7% higher than the MultiSFs-AllFound and MultiSFs-ManualSel
classifiers). The Precision is 1.4% higher than when the classifier is trained on manually
selected sub-flows and 1.6% lower than the Precision of the classifier trained on all sub-
flows identified in Step 1.
• A reduction in the required classification model build time by ∼15%, improvement in
the classification speed by ∼2%, and consumption of less than 15% memory resource
compared to training on all sub-flows identified in Step 1.
• Faster classification speed (by ∼4%), a longer required model build time (by∼9%) and
consumption of similar memory usage to the classifier trained on manually selected sub-
flows.
The results of training the Naive Bayes classifiers on the sub-flows selected automatically
by the EM algorithm include:
150 CHAPTER 6. CLUSTERING FOR AUTOMATED SUB-FLOW SELECTION
Naive Bayes C4.5 Decision Tree
Nor
mal
ised
Mod
el B
uild
Tim
e
MultiSFs−AutoSel MultiSFs−AllFound MultiSFs−ManualSel
00.
20.
40.
60.
81
(a) Normalised model build time
Naive Bayes C4.5 Decision Tree
Nor
mal
ised
Mem
ory
Usa
ge
MultiSFs−AutoSel MultiSFs−AllFound MultiSFs−ManualSel
00.
20.
40.
60.
81
(b) Normalised memory usage while performing 10-times cross validation
Naive Bayes C4.5 Decision Tree
Nor
mal
ised
Cla
ssifi
catio
n S
peed
MultiSFs−AutoSel MultiSFs−AllFound MultiSFs−ManualSel
00.
20.
40.
60.
81
(c) Normalised classification speed
Figure 6.20: Summary of computational performance results for Naive Bayes (NB) and C4.5Decision Tree (DT) classifiers trained on multiple sub-flows
6.5. SAMPLING FOR FASTER CLUSTERING 151
• A slightly higher Recall (0.3%) than the MultiSFs-ManualSel classifier, and a slightly
lower Recall (0.2%) than the MultiSFs-AllFound classifier. The median Precision is sim-
ilar to when the classifier is trained on manually selected sub-flows and 0.3% higher than
the Precision of the classifier trained on all sub-flows identified in Step 1.
• A reduction in the required classification model build time (by ∼29%), improvement in
the classification speed (by 10%), and consumption of less than 15% memory resource to
training on all sub-flows identified in Step 1.
• A slower classification speed (by ∼3%), a longer required model build time (by∼15%)
and consumption of ∼ 10% more memory resource compared to when the classifier is
trained on manually selected sub-flows.
The results suggest that we can automatically train an effective classifier without requiring
expert knowledge of the application of interest. Clustering ML techniques offer distict advan-
tage in building a faster classifier in real-time classification with high Precision and Recall.
Using all sub-flows found in Step 1 can also create accurate classifiers with high Precision and
Recall. However, this requires longer model training time and the classifiers are slower in clas-
sification speed. This could become an issue when scaling to multiple concurrent application
classification. From my results, while the C4.5 Decision Tree classifiers take a longer time to
build , they are much faster (nearly three times) than Naive Bayes classifiers and have higher
Precision and Recall overall. This is consistent with the previous findings of [171].
6.5 Sampling for faster clustering
6.5.1 The problem
One limitation of my current experimental approach is the slow clustering time using the EM
algorithm. With the sub-flows identified in Step 1, the clustering process took up to 172 CPU
hours to complete in the supercomputer environment described earlier. Alhough this step can be
carried out offline, this should be improved so that it will not outweigh the gain in the required
classification model build time discussed above.
This can be improved in a number of ways:
152 CHAPTER 6. CLUSTERING FOR AUTOMATED SUB-FLOW SELECTION
• Using a smaller number of iterations when running the EM algorithm.
• Using another ML clustering algorithm.
• Using a more powerful processing unit.
• Down-sampling the number of instances for the clustering process.
Each of these solutions involves trade-offs between the gain in processing overhead (espe-
cially the CPU time), cost of the processing unit, and the quality of the clusters produced.
6.5.2 Down-sampling for the clustering proposal
In this section I investigate the method of down-sampling the dataset for clustering. My pro-
posed solution is to sample only a small number of instances from each sub-flow class identified
from Step 1 to use as input to the clustering process in Step 2. With the aim of understanding the
statistical properties of an application’s traffic, small samples of flow instances may be sufficient
to give us valuable hints for representative sub-flows.
For each sub-flow identified in Step 1, I sampled randomly 25, 50, 100 instances and used
them as input for Step 2’s clustering process (compared to more than 1,000 instances per sub-
flow for a full dataset as presented in Figure 6.4). I measured the time taken for the clustering
process to complete for each case, and compared them with the time taken when all sub-flows’
members were used. The clusters produced in each case were evaluated using the same method
as described earlier in this chapter.
6.5.3 Results and analysis
Figure 6.21 depicts the number of instances used for clustering, and the normalised clustering
time for different sample sizes. A value of 1 represents the longest time (172 hours). The
clustering time is proportional to the number of instances used for the clustering process. As
shown in this figure, down-sampling the number of instances for each sub-flow significantly
reduces the CPU time required for clustering. Using 100 samples per sub-flow only took 0.64%
of the clustering time when using all instances per sub-flow. The differences in clustering time
for using 25, 50 and 100 samples per sub-flow is insignificant in this experiment (at less than
0.1%).
6.5. SAMPLING FOR FASTER CLUSTERING 153
25 50 100 All
Nor
mal
ised
Tra
inin
g In
stan
ces
for C
lust
erin
g P
roce
ss
00.
20.
40.
60.
81
Number of instances sampled per sub−flow
(a) Normalised number of instances for the clustering process
25 50 100 All
Nor
mal
ised
Clu
ster
ing
Tim
e
00.
20.
40.
60.
81
Number of instances sampled per sub−flow
(b) Normalised clustering time
Figure 6.21: Sampled clustering
154 CHAPTER 6. CLUSTERING FOR AUTOMATED SUB-FLOW SELECTION
Table 6.2 presents the results of the clustering process in terms of number of sub-flows
selected.
Table 6.2: Number of sub-flows selected automatically by the clustering processNumber of instances sampled persub-flow
25 50 100 All
Number of clusters produced 7 9 8 8
Based on the resultant clusters, four Naive Bayes and C4.5 Decision Tree classifiers were
built and compared. Figure 6.21 shows Precision and Recall results for the Naive Bayes classi-
fiers. There are trade-offs in terms of Precision and Recall achieved. Sub-flows selected when
sampling 25 instances per sub-flow produce a classifier with the lowest median Recall of 92%.
Sub-flows selected when sampling ≥ 50 instances per sub-flow produce classification models
with better Recall, with a median of greater than 95%. Interestingly, the exerimental results
reveal that using 50 instances per sub-flow seems to produce a classifier with the best combina-
tion of Precision and Recall, slightly better than using 100 instances per sub-flow. However, the
difference is less than 0.3%. Although finding an optimal number of samples is left to future
research, my results can be taken as an indication that we only need a small number of sam-
ples for the clustering process to produce a good sub-flows selection. This assists in markedly
reducing the CPU time required for the sub-flow selection step.
Figure 6.22 shows the Precision and Recall results for the C4.5 Decision Tree classifiers.
Similar to the results seen with the Naive Bayes classifiers, sub-flows selected when sampling
25 instances per sub-flow produce a classifier with lower Recall compared to no sampling.
However, sampling 50 or 100 instances per sub-flow produce classification models with as
good Recall and Precision as in the case of no sampling.
Since the number of clusters (and representative sub-flows) identified by the EM for differ-
ent sampling rates are similar, the differences in model build time and classification speed for
the classifiers are small. The results are presented in Figure 6.24 and 6.25
In summary, the results of this section demonstrate that we can reduce the time taken for
the clustering process significantly by using a small number of instances without noticeably
compromising the classifier’s performance.
6.5. SAMPLING FOR FASTER CLUSTERING 155
9092
9496
9810
0
Rec
all(%
)
25 50 100 All
Number of instances sampled per sub−flow
(a) Recall
9092
9496
9810
0
Pre
cisi
on(%
)
25 50 100 All
Number of instances sampled per sub−flow
(b) Precision
Figure 6.22: Precision and Recall for Naive Bayes classifiers using sub-flows selected by EMwith small numbers of samples for the clustering process.
156 CHAPTER 6. CLUSTERING FOR AUTOMATED SUB-FLOW SELECTION
9092
9496
9810
0
Rec
all(%
)
25 50 100 All
Number of instances sampled per sub−flow
(a) Recall
9092
9496
9810
0
Pre
cisi
on(%
)
25 50 100 All
Number of instances sampled per sub−flow
(b) Precision
Figure 6.23: Results for C4.5 Decision Tree classifiers using sub-flows selected by EM withsmall numbers of samples for the clustering process.
6.5. SAMPLING FOR FASTER CLUSTERING 157
Naive Bayes C4.5 Decision Tree
Nor
mal
ised
Cla
ssifi
catio
n M
odel
Bui
ld T
ime 25 samples per sub−flow
50 samples per sub−flow100 samples per sub−flowAll sub−flow’s instances
00.
20.
40.
60.
81
Figure 6.24: Normalised Model Build Time for classifiers trained on sub-flows selected by EMwith small numbers of samples used in the clustering process
Naive Bayes C4.5 Decision Tree
Nor
mal
ised
Cla
ssifi
catio
n S
peed
25 samples per sub−flow50 samples per sub−flow
100 samples per sub−flowAll sub−flow’s instances
00.
20.
40.
60.
81
Figure 6.25: Normalised classification speed for classifiers trained on sub-flows selected by EMwith small numbers of samples used in the clustering process
158 CHAPTER 6. CLUSTERING FOR AUTOMATED SUB-FLOW SELECTION
6.6 Discussion and future work
There are a number of limitations in my current experimental approach. Further improvement
can be gained in the following areas:
• This chapter studies the sub-flows selection for ET traffic. Sub-flows used to train the
Other class are not being optimised. One reason for this is that most of the Other class’s
example flows are short. Thus I chose only two sub-flows taken at the beginning and
middle of the original full-flows as described in section 5.3. Further investigation on
which sub-flows are best for the Other class to train the classifier may lead to a better
result.
• In the evaluation metric, I have not considered the prior processing overhead, which is the
processing for the preparation of datasets to train a classifier. For each of the classifiers
listed above the prior processing overhead includes:
– Full-flow classifier: Features calculation for all full-flow instances in the data trace.
– Filtered full-flow classifier: Processing for the removal of flows shorter than the size
of the sliding window and features calculation for all longer full-flow instances in
the data trace.
– MultiSFs-ManualSel classifier: Study and analysis to understand the statistical char-
acteristics of the application’s traffic; features calculation for a selected number of
sub-flows; searching for the best combination of sub-flows to produce the classifier.
– MultiSFs-AutoSel: Features calculation for all sub-flows identified in Step 1 and in
the clustering process in Step 2.
– MultiSFs-AllFound: Features calculation for all sub-flows identified in Step 1.
Some components may not be able to be precisely measured (such as the study and analy-
sis to understand the statistical characteristics of the application’s traffic, which normally
is performed by a domain expert) or may be too dependent on the choice of implementa-
tion (such as a recursive search for the best combination of sub-flows to train a classifier).
6.6. DISCUSSION AND FUTURE WORK 159
• Feature calculation takes a finite period of time, depending on the complexity of features
used by a particular ML-based IPTC system. In my experiments, the feature sets are
simple, and statistics can be computed incrementally when a packet arrives in the sliding
window. I consider the feature calculation time to be relatively small compared to the
total time taken to collect N packets for the classification 9. Consequently my focus
in this chapter has been on overall classification speed. More detailed analysis of the
computational load of alternative features is a topic for future work.
• More sub-flow positions could be chosen in Step 1 in the experiments to make the com-
parison between MultiSFs-AutoSel and MultiSFs-AllFound classifiers clearer.
• The test dataset is constructed with a static selection of the sliding window. Hence the
stability and consistency of the classification result is limited to the selected positions of
the sliding window tested. Testing the classifier models in a live network would be ideal.
In future work, ML clustering techniques can be used to recognise new and unknown appli-
cations as described below.
One well-known problem related to using supervised ML techniques for classification is the
inability to detect new and unknown applications. For example, a classifier uses supervised ML
techniques to identify ET against Other applications, where the Other class is trained with the
traffic of known applications, such as FTP, Kazza, Email, and Web. The classifier performs very
well until a new and unknown application is introduced into the network, such as VoIP traffic.
Since the classifier has not yet been updated with the newly emerged application, it will classify
some of the VoIP traffic as ET, and some of the VoIP as Other traffic. The classifier’s Precision
will degrade as a result.
An ML clustering technique may offer a solution. A small sample of the classifier’s output
can be used to keep track of the historical profile of the traffic’s statistics and its variation trends,
using ML clustering techniques. When a significant change in the application’s statistical pro-
file is detected, we know the classifier should be updated. For example, when the clustering
9With a sliding window of 25 packets, it usually takes less than 0.5 second to collect enough packets for aclassification. Most processors handle millions of instructions per second, so calculating simple mathematicalfeatures in microseconds represents a trivial fraction of the typical arrival time between packets making up thesliding window.
160 CHAPTER 6. CLUSTERING FOR AUTOMATED SUB-FLOW SELECTION
technique detects a new cluster, the new cluster will be examined and traced back to its source
application. Its traffic then will be collected to re-train and update the classifier. Figure 6.26
illustrates the idea.
My results presented in section 6.5 suggest that only a small number of the classifier’s output
traffic is needed for this purpose.
6.7 Conclusion
In this chapter I extend my work on training with multiple sub-flows presented in Chapter 5
to include the idea of using clustering ML techniques for automated sub-flow selection. This
extension is significant for the deployment of the proposed approach to classify new applications
of interest. It eliminates the need for expert knowledge of the application and relieves the
complexity of manually choosing the best combination of sub-flows to train the classifier. I
have presented a performance comparison between the use of my approach, the traditional full-
flow training, the use of all identified sub-flows without a selection method, and the use of
sub-flows selected manually.
The results suggest that my proposed approach has the potential to select the optimal num-
ber of representative sub-flows for training, which takes into account the trade-offs between
accuracy and computational performance. One limitation of my approach is the long time taken
in the clustering process using the EM algorithm. I have proposed and evaluated an approach
to overcome this limitation by down-sampling the training instances for the clustering process.
The application of my proposed method for other Internet applications and the trade-offs
in selecting parameters such as the classification window size (N) and forwarding step (S) are
subjects for future work.
6.7. CONCLUSION 161
Labelled 'Game' class
VoIP
Game
Web, P2P, SSH, SMTP
Traffic classifier
Game
Other
ML
Classification model Game or Other
Classification
Training
Labelled 'Other' class
Optional data sampling and features filtering/selection
Features calculation Features calculation
Game traffic Web, P2P, SSH,
SMTP traffic VoIP traffic
Traffic sampling
Traffic sampling
Clustering
Detection of significant changes in clustering
results (e.g new cluster for new and
unknown traffic)
Investigate and update the
classification model (e.g. adding the new and unknown traffic
to the training dataset for Other
class to re-train the classifier )
Model Update
Figure 6.26: An illustration of updating a classifier when new, previously unknown traffic isdetected
Chapter 7
Training Using Synthetic Sub-Flow Pairs
7.1 Introduction
In Chapter 5 I presented a novel technique to train a classifier on a combination of short sub-
flows, such that IP flows can be classified in a finite period of time, starting at an arbitrary point
in a flow’s lifetime. In Chapter 6 I proposed and demonstrated an automated approach based
on the use of clustering ML techniques to choose appropriate, representative sub-flows, from
which a classifier may be trained. In this chapter, I present an improvement to the training phase
so that subsequent flow classifications need not rely on prior knowledge of inferred or actual
directionality of a flow.
The directional neutrality issue was identified in section 3.3.2 and discussed in sections
4.6.2 and 5.5. It is a requirement for classifiers that rely on bi-directional statistics to make
a distinct assumption about the direction of each packet captured to calculate feature values.
This becomes a challenge when classifying in an operational network, where such assumptions
about the traffic direction can be wrong.
In this chapter, I propose and evaluate a novel approach for direction-neutral classification.
I train the ML classifier using:
• multiple short sub-flows’ instances extracted from the full-flows generated by the target
application. The feature values of these instances are calculated with the forward direction
defined as the client-to-server direction;
• and their mirror-image replicas, as if the flows were in the reverse direction, that is,
features of the multiple short sub-flow instances are transposed and replicated to construct
162
7.2. PROPOSAL USING A SYNTHETIC SUB-FLOW PAIRS APPROACH 163
a synthetic ‘pair’ of features.
The combination of a sub-flow instance and its mirror-image replica is called a synthetic
sub-flows pair (SSP). In classification, the forward direction is defined as the direction of the
first packet captured in the sliding window, regardless of whether it is from client to server or
server to client. This helps the classifier identify traffic flows either way.
For example, consider a classifier trained with the following simple scenario: a flow whose
first packet is destined for port 25 is an SMTP flow. However, imagine if the classifier misses
the first packet of the SMTP flow, instead capturing a later packet (the reply from server to
client). This packet has the source port of 25 instead of the destination port. The classifier then
classifies the flow as non-SMTP, when in fact it is an SMTP flow. To overcome the problem, the
classifier should be trained with a rule that: a flow whose first packet is destined or originated
from port 25 is an SMTP flow. Hence the classifier would not miss the SMTP flow.
I demonstrate my optimisation when applied to the Naive Bayes and C4.5 Decision Tree
classifiers, and show that the SSP approach results in good performance even when classifica-
tion is initiated mid-way through a flow, without prior knowledge of the flow’s direction.
In the next section, I present the SSP approach. Section 7.3 presents the details of my ex-
perimental approach, section 7.4 analyses the experimental results, followed by the conclusion
in section 7.5.
7.2 Proposal using a synthetic sub-flow pairs approach
Training on mirror-image replicas of each sub-flow is an important augmentation of the tech-
nique demonstrated in Chapter 5. Figure 3.5’s key steps of feature calculation (F), training (T)
and classification (C) are illustrated in Figure 7.1. F, T and C denotes the features calculation,
training and classification steps respectively. In step F, features are calculated as described in
Section 5.3.1. Each sub-flow’s instance then is labelled as either ET or Other class to train the
ML classifier. Output of the training are the classification rules to identify ET and Other traffic
in future.
With the SSP approach, the dataset for mirror-image replicas of the sub-flows is created
artifically in a separate step called F’. From the features calculation in Step F, the mirror-image
164 CHAPTER 7. TRAINING USING SYNTHETIC SUB-FLOW PAIRS
F
ET Traffic
T
F
Other Traffic
ET Rules
Other Rules
C
ET
Other
Labelled ET class
Labelled Other class
Figure 7.1: Steps in training an ML classifier for identification of ET traffic versus Other traffic- without using the SSP approach
replica of a sub-flow instance is created by swapping its feature values in the client-to-server
and server-to-client (forward and backward) directions.
Figure 7.2 presents an example to illustrate how a mirror-image replica is created for a sub-
flow instance X. Consider LF and LB to be the mean forward and backward packet lengths
(respectively) of sub-flow instance X. The mirror-image replica of sub-flow instance X is as-
signed mean forward and backward packet lengths of LB and LF respectively. The same trans-
position (mirroring) step is repeated for other features of sub-flow instance X.
Mean forward
packet length
Mean backward
packet length …
Mean forward packet inter - arrival time
Mean backward
packet inter - arrival time
Sub - flow instance X
L F L B … I F I B
Mirror - image replica of sub - flow
instance X
L B L F … I B I F
Figure 7.2: An illustration of how to create a mirror-image replica for a sub-flow instance
The sub-flows’ instances and their mirror-image replicas are then labelled to train a classi-
fier. There are two options for building a classifier using SSP.
In Option 1, both sub-flows’ instances and their mirror-image replicas are labelled as one
7.2. PROPOSAL USING A SYNTHETIC SUB-FLOW PAIRS APPROACH 165
F
ET Traffic
T
F'
F
F'
Other Traffic ET Rules
Other Rules
C
ET
Other
Labelled ET class
Labelled Other class
The OR function
Figure 7.3: Option 1: Both sub-flows’ instances and the mirror-image replicas of every shortsub-flow are labelled as one class. The classifier is trained with two classes: ET and Other.
class. For example, ET instances and their mirror-image replicas are both labelled as ET class.
This option trains the classifier such that a new flow which has traffic characteristics similar
to either ET or its mirror-image replica will be classified as ET traffic. The OR function, as
indicated in Figure 7.3, is placed before the training T step.
In Option 2, sub-flows’ instances and their mirror-image replicas are labelled independently
as two separate classes. For example ET instances are labelled as ET class, and their mirror-
image replicas are labelled as ET’ class. This option trains the classifier to identify ET, ET’,
Other and Other’ classes separately. Then a new flow which is classified as ET OR ET’ will
be classified as ET traffic. The OR function, as indicated in Figure 7.4, is placed after the
classifying C step.
Figure 7.5 provides an example of datasets used to train a classifier for ET traffic using
Option 1 and Option 2.
Figure 7.6 presents an example to illustrate the proposal. Feature values for sub-flows in-
stances form a cluster of pink squares. Feature values for these sub-flows instances’ mirror-
image replicas form a cluster of blue circles. For applications with asymmetric statistics in the
forward and backward directions, these clusters are mostly disjoint. Training on multiple sub-
166 CHAPTER 7. TRAINING USING SYNTHETIC SUB-FLOW PAIRS
F
ET Traffic
T
F'
F
F'
Other Traffic
ET Rules
ET' Rules
Other Rules
Other' Rules
C
ET
ET'
Other
Other'
ET
Other
Labelled ET class
Labelled ET' class
Labelled Other class
Labelled Other' class
The OR function
Figure 7.4: Option 2: Sub-flows’ instances and their mirror-image replicas are labelled inde-pendently as two separate classes. The lassifier is trained with four classes: ET, ET’, Other andOther’.
flows left out many members of the sub-flows’ mirror-image replicas (outliers to the multiple
sub-flows cluster). Training on SSP ensures that these members are included in constructing the
classification model. The classifier’s Recall, therefore can be improved when the classifier does
not need to make an assumption about the direction of the first packet captured in the sliding
window.
On the other hand, the inclusion of these members creates an unwanted area, which is the
gap between the contributing clusters (indicated by the grey area in the figure), compared to
training without SSP. Depending on the internal construction of an ML classification algorithm,
and the method of implementation of SSP (i.e. whether Option 1 or Option 2) this area may
have different impact on the classifier’s Precision.
In Option 1, the synthetic sub-flow pairs share the same class in training the classifier. As a
result, the grey area is included in training a classifier, which could create opportunities for false
positives, which leads to lower Precision. Using Option 2, a classifier is trained with multiple
sub-flows instances and their mirror-image replicas separately. This means that the classifier is
trained to recognise members of the pink squares and blue circles clusters without the need to
include the grey area in the training phase. This may have positive impacts on Precision of the
7.2. PROPOSAL USING A SYNTHETIC SUB-FLOW PAIRS APPROACH 167
Mean forward
packet length
Mean backward
packet length …
Mean forward
packet inter - arrival time
Mean backward
packet inter - arrival time
Class
Sub - flow instance X 1
L F 1 L B 1 … I F 1 I B 1 ET
Mirror - image replica of sub - flow
instance X 1
L B 1 L F 1 … I B 1 I F 1 ET
… … … … … … … Sub - flow
instance X n
L F n L B n … I F n I B n ET
Mirror - image replica of sub - flow
instance X n
L B n L F n … I B n I F n ET
(a) Option 1: Sub-flow instance X and its mirror-image replica X’ are both labelled asET class
Mean forward
packet length
Mean backward
packet length …
Mean forward
packet inter - arrival time
Mean backward
packet inter - arrival time
Class
Sub - flow instance X 1
L F 1 L B 1 … I F 1 I B 1 ET
Mirror - image replica of sub - flow
instance X 1
L B 1 L F 1 … I B 1 I F 1 ET ’
… … … … … … … Sub - flow
instance X n
L F n L B n … I F n I B n ET
Mirror - image replica of sub - flow
instance X n
L B n L F n … I B n I F n ET ’
(b) Option 2: Sub-flow instance X and its mirror-image replica X’ are labelled as ETand ET’ classes respectively
Figure 7.5: Example datasets used to train a classifier using Option 1 and Option 2
168 CHAPTER 7. TRAINING USING SYNTHETIC SUB-FLOW PAIRS
Their mirror-image replicas
Sub-flow instances Unwanted area
Figure 7.6: An illustration of creating SSP classifier from sub-flow instances and their mirror-image replicas (data points are artifically created for illustration purposes only.)
classifier.
However, the classifier built with Option 1 is simpler, with entailing two-classes classifi-
cation. Option 2, with a 4-classes classification, requires more processing complexity to train
the classifier. In the latter, the classification rules could be more complicated (for example,
involving a much larger tree size for the C4.5 Decision Tree classifier), leading to a slower
classification in real-time.
The following sections present the results of my study on Naive Bayes and C4.5 Decision
Tree classifiers trained without SSP, and with SSP using Option 1 and Option 2.
7.3 Illustrating the Synthetic Sub-Flow Pairs Training Approach
To illustrate my proposal I use the same scenario as described in Chapter 5: a real-time Naive
Bayes and C4.5 Decision Tree classifier must accurately identify asymmetric Wolfenstein En-
emy Territory traffic mixed in among unrelated, interfering traffic. The same training, testing
datasets and feature set as in Chapter 5 are used.
7.3. ILLUSTRATING THE SYNTHETIC SUB-FLOW PAIRS TRAINING APPROACH 169
7.3.1 Experimental data
As seen in Chapter 5 (section 5.3.3), ET traffic characteristics are noticeably asymmetric. Mea-
sured across all the ET flows in the test dataset, Figure 7.7 shows the percentage of sub-flows
whose first packet is in the client-to-server direction as a function of M – the number of packets
offset from the start of the full-flow 1. Not surprisingly, this is 100% when M = 0, and fluctuates
significantly for 1 ≤M ≤ 9 (the value does not reach 0% for M = 1 because for ∼35% of ET
flows, both the first and second packets seen on the wire are in the client-to-server direction).
This fluctuation is expected as this region is the Probing phase where the client is discoverying
the server. In the region 2000≤M ≤ 2009 (assumed to be the In-game phase) it is more stable.
There appears to be a ∼60:40 chance that the 2001st , 2002nd and ... 2009th packets traverse in
the client-to-server or server-to-client directions. This is consistent with my analysis of the data
trace in Chapter 5, where during ET game-play we see roughly 28 PPS from client to server and
20 PPS from server to client.
4050
6070
8090
100
Per
cent
age
(%)
0 1 2 3 4 5 6 7 8 920
0020
0120
0220
0320
0420
0520
0620
0720
0820
09
M (Number of packets offset from the beginning of each flow)
Figure 7.7: Percentage of flows that have the first packet captured in the client-to-server direc-tion if the first M packets are missed
In general, when a classifier model is trained with an explicit definition of the direction
(client-to-server direction or the forward direction is defined as the direction of the first packet of
1Here I chose to slide the classification window with a step of 1 packet. This is to make the alternating directionfrom client to server and server to client of the 1st packet in the sliding window clearer.
170 CHAPTER 7. TRAINING USING SYNTHETIC SUB-FLOW PAIRS
a full-flow), its Recall is dependent on the proportion of sub-flows’ instances that actually start
with the first packet traversing in the same direction (i.e. from client to server). If the first packet
of the sub-flow traverses in the opposite direction (i.e. from server to client), the classifier’s
performance will be negatively affected due to the asymmetric flow statistical properties in the
two directions. This is confirmed by my experimental results, as shown in section 7.4.
7.3.2 Test methodology
In my experiments, I study the performance of a classifier trained with an explicit definition
of flow direction that classifies in real-time. I show the classification accuracy of classifiers
trained on full-flow and multiple sub-flows that have the forward direction defined as the client-
to-server direction, when in testing the client-to-server (or forward) direction is defined as the
direction of the 1st packet captured in the sliding window, which can be from client to server or
server to client.
My experimental results reveal that training with SSP using both Options 1 and 2 allows us
to achieve high Recall and Precision for both the Naive Bayes and C4.5 Decision Tree classi-
fiers. My results also confirm that classification performance is maintained, even when packets
are missed at the beginning of a flow and regardless of the direction of the first packet captured.
7.4 Results and analysis
First I look at the effectiveness of classifying data using a sliding window across the test dataset
and an ML classifier trained on full-flow, filtered full-flow and multiple sub-flows2. Then I
show how Recall and Precision improve when each ML classifier is trained using SSP Option
1 and SSP Option 2 instead. Similar to the work in Chapters 5 and 6 I use a sliding window of
25 packets.
7.4.1 Classifying without training on SSP
Figure 7.8 outlines Recall for the Naive Bayes full-flow model, filtered full-flow model and
multiple sub-flows model as each sliding window moves across the test dataset. M is the number
of packets offset from the beginning of each flow in the test dataset. The graphs cover two
2The multiple sub-flow model is trained on eight sub-flows found automatically by the EM algorithm describedin Chapter 6.
7.4. RESULTS AND ANALYSIS 171
periods: early client contact with the game server (0 ≤ M ≤ 9) and during active game-play
(2000≤M ≤ 2009).
Rec
all (
%)
0 1 2 3 4 5 6 7 8 9
2000
2001
2002
2003
2004
2005
2006
2007
2008
M (Number of packets offset from the beginning of each flow)
0102030405060708090
100
Full−flowFiltered Full−flow
Multiple Sub−flows
Figure 7.8: Recall for Naive Bayes classifiers trained on full-flow (full-flow model), filteredfull-flow (filtered full-flow model) and multiple sub-flows (multiple sub-flows model)
Recall for all three classifiers suffers as M increases above zero. Similar to the results seen
in Chapters 5 and 6, full-flow and filtered full-flow models result in very poor Recall. Training
on multiple sub-flows achieves better Recall with a median of greater than 65%. However,
even with the multiple sub-flows model, Recall fluctuates significantly around 66% when 0 ≤
M ≤ 9 and stays relatively stable at ∼70% when 2000≤M ≤ 2009. More importantly, we see
noticeable shifts in Recall each time the sliding classification window moves by one packet.
This is a direct consequence of the classifier assuming (sometimes incorrectly) that the first
packet in the sliding window represents the client-to-server direction, when in reality it does
not (as shown in Figure 7.7).
Figure 7.9 shows Precision for the three Naive Bayes classifiers. While the full-flow model
displays the maximum Precision of 100%, this is an indication of an over-fitting problem, as
discussed in section 5.4. The high Precision of this classifier does not have much significance
due to the low Recall achieved. Consistent with the results presented in Chapter 5, the filtered
full-flow model has a low Precision that fluctuates around ∼ 40% when 0 ≤ M ≤ 9 and stays
lower than 10% when 2000≤M ≤ 2009. The classifier trained on multiple sub-flows achieves
172 CHAPTER 7. TRAINING USING SYNTHETIC SUB-FLOW PAIRS
Pre
cisi
on (%
)
0 1 2 3 4 5 6 7 8 9
2000
2001
2002
2003
2004
2005
2006
2007
2008
M (Number of packets offset from the beginning of each flow)
0102030405060708090
100
Full−flowFiltered Full−flow
Multiple Sub−flows
Figure 7.9: Precision for Naive Bayes classifiers trained on full-flow, filtered full-flow andmultiple sub-flows
greater than 88% Precision for all M values. A common point for all three classifiers is that their
Precision fluctuates noticeably when 0 ≤M ≤ 9, and less noticeably when 2000 ≤M ≤ 2009.
This is consistent with the fluctuation in the probability of the first packet in the sliding window
in the client-to-server direction as shown in Figure 7.7.
Figures 7.10 and 7.11 depict Recall and Precision for full-flow, filtered full-flow and multi-
ple sub-flows C4.5 Decision Tree classifiers. Similar to the results seen with the Naive Bayes
classifiers, all three C4.5 Decision Tree classifiers perform better when the classifier correctly
assumes the first packet in the sliding window is in the client-to-server direction. Their Recall
and Precision degrade otherwise.
The C4.5 Decision Tree classifier trained on multiple sub-flows, while achieving the best
Recall and Precision among the three, still suffers when the classifier incorrectly assumes the
direction of the first packet. Its median Recall is low, fluctuating around 66% when 0≤M ≤ 9
and remaining at ∼ 76% when 2000≤M ≤ 2009. Its Precision fluctuates above 90% for all M
values.
7.4.2 Training on SSP Option 1, classifying with a sliding window
Figure 7.12 compares Recall as a function of M for the Naive Bayes classifier trained with SSP
Option 1 and the multiple sub-flows model.
7.4. RESULTS AND ANALYSIS 173
Rec
all (
%)
0 1 2 3 4 5 6 7 8 9
2000
2001
2002
2003
2004
2005
2006
2007
2008
M (Number of packets offset from the beginning of each flow)
0102030405060708090
100
Full−flowFiltered Full−flow
Multiple Sub−flows
Figure 7.10: Recall for C4.5 Decision Tree classifiers trained on full-flow, filtered full-flow andmultiple sub-flows
Pre
cisi
on (%
)
0 1 2 3 4 5 6 7 8 9
2000
2001
2002
2003
2004
2005
2006
2007
2008
M (Number of packets offset from the beginning of each flow)
0102030405060708090
100
Full−flowFiltered Full−flow
Multiple Sub−flows
Figure 7.11: Precision for C4.5 Decision Tree classifiers trained on full-flow, filtered full-flowand multiple sub-flows
174 CHAPTER 7. TRAINING USING SYNTHETIC SUB-FLOW PAIRS
Rec
all (
%)
0 1 2 3 4 5 6 7 8 920
0020
0120
0220
0320
0420
0520
0620
0720
0820
09
M (Number of packets offset from the beginning of each flow)
0102030405060708090
100
SSP Option 1Multiple Sub−flows
Figure 7.12: Recall for Naive Bayes classifiers trained using SSP Option 1 and multiple sub-flows
A Naive Bayes classifier trained using SSP Option 1 shows significant improvement in
Recall (a median of 98.9%) compared with the multiple sub-flows model (a median of 72.1%).
More importantly, Recall is more stable, less affected by the implications of the directions of
the traffic flows when the classifier misses M packets.
However, there are trade-offs between gain in Recall and loss in Precision. Figure 7.13
summarises Precision for the two Naive Bayes classifiers. Compared to the classifier trained on
multiple sub-flows, the SSP Option 1 classifier achieved 3% lower Precision on average for all
M values, and its Precision remains at 85.2%-89.8%. It is also notable that median Precision
when M ≥ 2000 is lower than when 0 ≤M ≤ 9. This is due to a smaller number of ET flows
when M ≥ 2000, which leads to a smaller number of true positives for ET traffic as explained
in section 5.4.
A C4.5 Decision Tree classifier trained using SSP Option 1 shows similar significant im-
provement in Recall. As presented in Figure 7.14, it displays good Recall (a median of 99.3%)
compare to the one trained on multiple sub-flows (a median of 75.2%). Its Recall is not only
higher but also more stable, and less affected by the implications of the directions of the traffic
flows when the classifier misses M packets.
Figure 7.15 presents Precision for the SSP Option 1 and multiple sub-flows C4.5 Decision
7.4. RESULTS AND ANALYSIS 175
Pre
cisi
on (%
)
0 1 2 3 4 5 6 7 8 920
0020
0120
0220
0320
0420
0520
0620
0720
0820
09
M (Number of packets offset from the beginning of each flow)
0102030405060708090
100
SSP Option 1Multiple Sub−flows
Figure 7.13: Precision for Naive Bayes classifiers trained using SSP Option 1 and multiplesub-flows
Rec
all (
%)
0 1 2 3 4 5 6 7 8 920
0020
0120
0220
0320
0420
0520
0620
0720
0820
09
M (Number of packets offset from the beginning of each flow)
0102030405060708090
100
SSP Option 1Multiple Sub−flows
Figure 7.14: Recall for C4.5 Decision Tree classifiers trained using SSP Option 1 and multiplesub-flows
176 CHAPTER 7. TRAINING USING SYNTHETIC SUB-FLOW PAIRS
Tree classifiers. In contrast to the Naive Bayes classifiers, there is also a gain in Precision when
training on SSP Option 1. The Precision increases by 2.7% on average, staying at a 97.3%-
98.2% for the SSP Option 1 classifier. This is due to the different responses of the Naive Bayes
and C4.5 Decision Tree classifiers when trained with the unwanted range of feature values (as
described earlier in Figure 7.6). In the next section I evaluate SSP Option 2, which does not
include the unwanted grey area when training the classifiers.
Pre
cisi
on (%
)
0 1 2 3 4 5 6 7 8 920
0020
0120
0220
0320
0420
0520
0620
0720
0820
09
M (Number of packets offset from the beginning of each flow)
0102030405060708090
100
SSP Option 1Multiple Sub−flows
Figure 7.15: Precision for C4.5 Decision Tree classifiers trained using SSP Option 1 and multi-ple sub-flows
7.4.3 Training on SSP Option 2, classifying with a sliding window
In this section, I compare Precision, Recall, model build time and classification speed3 for
Naive Bayes and C4.5 Decision Tree classifiers trained on multiple sub-flows (multiple sub-
flows model), SSP Option 1 (SSP Option 1 model) and SSP Option 2 ( SSP Option 2 model).
Figure 7.16 summarises Recall for the three Naive Bayes classifiers .
As shown in Figure 7.16, a Naive Bayes classifier trained using SSP Option 2 has almost
identical Recall to those trained using SSP Option 1. Both models show great improvement in
Recall – higher and more stable – compared to the classifier trained on multiple sub-flows only.
Figure 7.17 summarises Precision for the three Naive Bayes classifiers.
3Model build time and classification speed are two evaluation metrics defined in Chapter 6.
7.4. RESULTS AND ANALYSIS 177
Rec
all (
%)
0 1 2 3 4 5 6 7 8 9
2000
2001
2002
2003
2004
2005
2006
2007
2008
M (Number of packets offset from the beginning of each flow)
405060708090
100
Multiple Sub−flowsSSP Option 1
SSP Option 2
Figure 7.16: Recall for Naive Bayes classifiers trained using SSP Option 1, SSP Option 2 andMultiple Sub-Flows
Pre
cisi
on (%
)
0 1 2 3 4 5 6 7 8 9
2000
2001
2002
2003
2004
2005
2006
2007
2008
M (Number of packets offset from the beginning of each flow)
0102030405060708090
100
Multiple Sub−flowsSSP Option 1SSP Option 2
Figure 7.17: Precision for Naive Bayes classifiers trained using SSP Option 1, SSP Option 2and multiple sub-flows
178 CHAPTER 7. TRAINING USING SYNTHETIC SUB-FLOW PAIRS
A Naive Bayes classifier trained using SSP Option 2 has increased Precision by approx-
imately 5% for all M values compared to being trained using SSP Option 1. This suggests
the positive impact on Precision for the Naive Bayes classifier when using this augmented op-
tion. In this case, Precision for the Naive Bayes classifier is even higher and more stable than a
classifier trained on multiple sub-flows only.
Figure 7.18 outlines Recall for the C4.5 Decision Tree classifiers. A C4.5 Decision Tree
classifier trained using SSP Option 2 has a slightly better Recall compared to being trained using
SSP Option 1. This suggests the positive impact on Recall for the C4.5 Decision Tree model
when the unwanted area for each feature is eliminated. Both models show great improvement
in Recall, which is both higher and more stable, compared to the classifier trained on multiple
sub-flows only.
Rec
all (
%)
0 1 2 3 4 5 6 7 8 9
2000
2001
2002
2003
2004
2005
2006
2007
2008
M (Packets)
405060708090
100
Multiple Sub−flows ModelSSP Option 1
SSP Option 2
Figure 7.18: Recall for C4.5 Decision Tree classifiers trained using SSP Option 1, SSP Option2 and Multiple Sub-Flows
Figure 7.19 depicts Precision for the C4.5 Decision Tree classifiers. A C4.5 Decision Tree
classifier trained using SSP Option 2 has almost identical Precision to those trained using SSP
Option 1. This suggests that the C4.5 Decision Tree algorithm is less affected by the inclusion
of the unwanted area when training the classifier compared to the Naive Bayes algorithm. This
is why the further augmentation of Option 2 has little impact on its Precision.
Figure 7.20 shows the normalised model build time and classification speed for the Naive
7.4. RESULTS AND ANALYSIS 179
Pre
cisi
on (%
)
0 1 2 3 4 5 6 7 8 9
2000
2001
2002
2003
2004
2005
2006
2007
2008
M (Number of packets offset from the beginning of each flow)
0102030405060708090
100
Multiple Sub−flowsSSP Option 1SSP Option 2
Figure 7.19: Precision for C4.5 Decision Tree classifiers trained using SSP Option 1, SSPOption 2 and Multiple Sub-Flows
Bayes and C4.5 Decision Tree classifiers. The value of 1 represents the longest time taken to
build a classification model of 1,636 seconds, and the highest classification speed of 12,303
instances per second.
Figure 7.20 indicates that the Naive Bayes and C4.5 Decision Tree classifiers trained using
SSP Option 1 take almost double the model build time compared to the same classifiers trained
on multiple sub-flows only. This result is to be expected as the training approach using SSP
Option 1 doubles the number of training instances for the classifiers. Using SSP Option 2
slightly increases model build time for the Naive Bayes classifier, while almost doubling the
model build time for the C4.5 Decision Tree classifier compared to the SSP Option 1 approach.
This is a trade-off when training the classifier for four-classes classification rather than two-
classes classification. The size of the tree for the C4.5 Decision Tree classifier increased by
20 times (a tree size of 3,507 versus 175 and number of leaves of 1,754 versus 88) in my
experiment.
Both the Naive Bayes and C4.5 Decision Tree classifiers trained using SSP Option 1 are
slightly slower compared to the same classifiers trained on multiple sub-flows ( by 1% and 7%
180 CHAPTER 7. TRAINING USING SYNTHETIC SUB-FLOW PAIRS
Naive Bayes Decision Tree
Nor
mal
ised
Mod
el B
uild
Tim
e
Multiple Sub−flowsSSP Option 1SSP Option 2
00.
20.
40.
60.
81
(a) Normalised Model Build Time
Naive Bayes Decision Tree
Nor
mal
ised
Cla
ssifi
catio
n S
peed
Multiple Sub−flowsSSP Option 1SSP Option 2
00.
20.
40.
60.
81
(b) Normalised Classification Speed
Figure 7.20: Computational performance for Naive Bayes and C4.5 Decision Tree classifierstrained on multiple sub-flows, SSP Option 1 and SSP Option 2
7.5. CONCLUSION 181
respectively). Training with SSP Option 2 slows down the Naive Bayes classifier trained on
multiple sub-flows by ∼9% and the C4.5 Decision Tree classifier trained on multiple sub-flows
by ∼60% (due to the significant increase in tree size mentioned previously).
To sum up, training on SSP Option 2 has produced classifiers which are not only accurate
but also stable. Recall for both the Naive Bayes and C4.5 Decision Tree classifiers is close
to 99% with Precision is close to 98% and 95% for the C4.5 Decision Tree and Naive Bayes
classifiers respectively. The classifier is able to maintain its performance regardless of where it
begins to capture packets of a given traffic flow in real-time classification.
As a trade-off, this results in a longer time required to build a classification model, and
slower classification speed, especially for the C4.5 Decision Tree classifier. In practice classifi-
cation models can be built offline, hence the longer model build time should not be considered
a significant drawback. The slower classification speed, however, may need to be considered in
deployment when scaling to classification of multiple applications concurrently.
Compared to SSP Option 1, SSP Option 2 improves Precision of the Naive Bayes classifier
significantly, with only a slight trade-off in terms of model build time and classification speed.
On the other hand, SSP Option 2 produces slight improvement in Recall for the C4.5 Decision
Tree classifier, with significant degradation in model build time and classification speed. Con-
sidering the gains in Precision and Recall, and the trade-offs on computational performance,
SSP Option 1 is considered to be the simpler and more effective choice at this stage.
7.5 Conclusion
Most research in the literature has assumed that a classifier will see the first packet of each bi-
directional flow, and that this initial packet will be from a client to a server. The classification
model is trained on the basis of this assumption, and subsequent evaluations have presumed the
ML classifier can calculate features with the correct sense of forward and reverse direction. In
real-time IPTC, this assumption can be wrong, especially when the classifier is initiated when
the traffic flows are already in progress.
To solve this problem, I have introduced a novel approach that further complements the
multiple sub-flows training approach presented in the previous chapter. I propose that the ML
classifier should be trained using synthetic sub-flow pairs (SSP). With SSP the statistical fea-
182 CHAPTER 7. TRAINING USING SYNTHETIC SUB-FLOW PAIRS
tures of multiple short sub-flows associated with a target application (as discussed in Chapter 5)
are transposed and replicated to construct a synthetic ‘pair’ of features. These pairs of sub-flow
features now reflect the statistical characteristics of a target application’s traffic, whether seen
in the forward or reverse direction.
My proposal is illustrated by constructing, training and testing with Naive Bayes and C4.5
Decision Tree classifiers for the detection of Wolfenstein Enemy Territory online game traffic.
With this particular scenario and two options for implementing SSP, I demonstrate that SSP
using either option can significantly improve a classifier’s performance when using a small slid-
ing window, regardless of the direction of the first packet of the most recent N packets used for
the classification. Significantly higher and more stable Recall is achieved compared to train-
ing solely on multiple sub-flows with an explicit definition of flow direction. Depending on
the internal construction of each ML algorithm, SSP Option 1 and SSP Option 2 have differ-
ent impacts on the Precision achieved. SSP Option 1 shows an improvement in Precision for
C4.5 Decision Tree classifiers, but degradation in Precision for Naive Bayes classifiers. Op-
tion 2 seems to improve the trade-off in Precision and Recall, especially for the Naive Bayes
classifiers. However, this option results in a longer required model build time and slower clas-
sification speed. With both accuracy (Precision and Recall) and computational performance
(model build time and classification speed) taken into consideration, SSP Option 1 is chosen to
be evaluated in the next chapter, as my overall proposed training method.
Chapter 8
Training Using Synthetic Sub-Flow Pairswith the Assistance of ClusteringTechniques (SSP-ACT)
8.1 Introduction
In Chapters 5, 6 and 7, I proposed building a practical and real-time ML-based IP traffic classi-
fier. This approach uses:
• Training on multiple sub-flows instead of full-flow for timely and continuous classifica-
tion.
• Clustering techniques to automate the sub-flows selection for training.
• Training on synthetic sub-flows pairs for direction-neutral classification.
I shall refer to this proposed approach as Synthetic Sub-flow Pairs with the Assistance of
Clustering Techniques (SSP-ACT). I have shown that SSP-ACT can significantly improve the
timely and direction-independent real-time classification of Wolfenstein Enemy Territory online
game traffic.
In this chapter, I study the generality and robustness of SSP-ACT. I demonstrate that SSP-
ACT also benefits the identification of VoIP traffic. Naive Bayes and C4.5 Decision Tree classi-
fiers trained using SSP-ACT can maintain their performance well with 5% random, independent
synthetic packet loss. SSP-ACT also can scale well to three-classes (ET versus VoIP versus
Other) classification simultaneously.
183
184 CHAPTER 8. TRAINING USING SSP-ACT
This chapter is organised as follows. In section 8.2 I demonstrate the robustness of SSP-
ACT when applied to the classification of VoIP traffic. In contrast to ET, VoIP traffic tends to
be more stable over a flow’s lifetime, and more symmetric in the caller-to-callee and callee-to-
caller directions.
Section 8.3 evaluates the performance of the Naive Bayes and C4.5 Decision Tree classi-
fiers trained in Chapter 7 when classifying ET and VoIP traffic in the presence of 5% random,
independent synthetic packet loss.
The possibility of using a single classifier to classify multiple applications simultaneously
is explored in section 8.4. Naive Bayes and C4.5 Decision Tree classifiers are trained using
SSP-ACT to identify ET, VoIP and Other traffic at the same time. A discussion of the pros and
cons of using a common classifier versus multiple classifiers in parallel, each for an individual
application, is presented.
This chapter concludes with a discussion of a number of remaining implementation issues
of SSP-ACT and some options for possible future work.
8.2 Evaluation of SSP-ACT in identifying VoIP traffic
To test the generality of the approach, I identify voice flows encoded with ITU-T G.711 PCMU1 [185] (G.711) and GSM 06.10 [186] (GSM) codecs and transported over RTP [101], using
Naive Bayes and C4.5 Decision Tree classifiers trained using SSP-ACT. This sections starts
with a brief background on VoIP, G.711 and GSM codec. This is followed by a description of
my data processing method. I then analyse the statistical properties of my VoIP data trace, and
justify why and how SSP-ACT can benefit the identification of VoIP traffic. Finally, Recall and
Precision results are presented and analysed.
8.2.1 A brief background on ITU-T G.711 PCMU and GSM 06.10 encoded voice traffic
VoIP has become a popular Internet application for both home users and enterprises. A voice
session is normally set up and torn down using a signalling protocol, such as Session Initiation
Protocol (SIP) [187]. The signalling packets carry information required to locate users and
allow them to negotiate compatible media types. Setup information can be conveyed between
1PCM µ-law
8.2. EVALUATION OF SSP-ACT IN IDENTIFYING VOIP TRAFFIC 185
participants using the Session Description Protocol (SDP) 2 [188].
Voice traffic is carried in Real-Time Transport Protocol (RTP) [101] flows. RTP provides
end-to-end delivery services suitable for real-time applications including interactive audio and
video. Different codecs may be utilised to encode and transmit VoIP traffic. G.711 PCMU codec
samples eight bits/sample at 8,000 samples/sec. The default packetisation interval is 20ms. With
this sampling interval, 50 frames/sec are generated, each frame containing 160 bytes of payload.
GSM 06.10 (Group Special Mobile) is a European standard for full-rate speech transcoding. It
is a frame-based codec, coding at 8,000 samples/sec, with a fixed packetisation interval of
20ms/frame. In the GSM packing used by RTP, every block of 160 audio samples is compressed
into a 33-octets frame [189].
During a voice call, there can be times when both parties are talking, yet a typical voice
conversation has a talk spurt period, followed by a silent period (e.g. one party listening to the
other). When one of the parties remains silent, background noise is picked up and sent over
the network. However, RTP allows discontinuous transmission (silence suppression) when one
party does not speak, to save bandwidth. When silence suppression is enabled, the line may
appear to have dropped at the receiving end. For this reason, comfort noise packets [190] are
generated to compensate for the lack of background noise. At the beginning of an inactive voice
segment (silence period), a comfort noise packet is transmitted in the same RTP voice stream
and indicated by the comfort noise payload type. The comfort noise generator algorithm at the
receiver uses the information in the comfort noise payload to update its noise generation model
and then to produce an appropriate amount of comfort noise. The comfort noise packet sending
rate is implementation specific. It may be sent periodically or only when there is a significant
change in the background noise characteristics [190].
8.2.2 Data collection and research methodology
My experiment data is 3.4GBytes of VoIP traffic extracted from 50GBytes of full-payload traffic
collected on a home network between 27 November 2006 and 3 August 20073. The raw data
2SDP is purely a format for session description. It is used in conjunction with different transport protocols asappropriate including SIP
3The network has been setup and maintained by two of my colleagues, L. Stewart and W. Harrop. The dataused in my experiments are personal VoIP calls going to and from two VoIP phones, connected to a VoIP providerthrough an Asterisk server.
186 CHAPTER 8. TRAINING USING SSP-ACT
trace is a mixture of voice and other Internet applications’ traffic. Voice traffic is extracted (as
described in Appendix D) to provide 644 RTP flows (made up of 594 G.711 flows and 50 GSM
flows) to use as benchmark VoIP flows for subsequent analysis.
These RTP flows contain voice calls with a duration ranging between 19 and 8,207 seconds.
The cumulative distribution of the flows’ duration is presented in Figure 8.1. Median call dura-
tion is 80 seconds (with a total of 7,061 packets in both directions), with the 75th percentile at
355 seconds (with a total of 30,530 packets in both directions).
0 2000 4000 6000 8000
0.0
0.2
0.4
0.6
0.8
1.0
Call duration (seconds)
Cum
mul
ativ
e D
istri
butio
n Fu
nctio
n (0
−1)
Figure 8.1: Cummulative distribution of call duration
To provide independent training and testing datasets I used 341 flows (52% of the available
data, which consists of 314 G.711 flows and 27 GSM flows) for clustering and training and the
remaining 303 flows (which consists of 280 G.711 flows and 23 GSM flows) for testing. This
choice was made in order that I have a good number of instances for both training and testing.
Statistical properties of VoIP flows
Figures 8.2 and 8.3 show the mean packet length and mean packet inter-arrival time feature
values from the caller to callee for G.711 traffic for sub-flows with a window size of 25 packets.
M is the number of packets offset from the beginning of each flow. (On the far right of the
x-axis, FF represents features calculated on full-flow.) Statistics of G.711 traffic in the reverse
direction are presented in Appendix D.
8.2. EVALUATION OF SSP-ACT IN IDENTIFYING VOIP TRAFFIC 187
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K FF
5010
015
020
0
M (Number of packets offset from the beginning of each flow)
Mea
n pa
cket
leng
th (B
ytes
)
Figure 8.2: G.711 traffic - forward direction, mean packet length calculated over a window of25 packets
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K FF
020
4060
8010
0
M (Number of packets offset from the beginning of each flow)
Mea
n pa
cket
inte
r−ar
rival
tim
e (m
sec)
Figure 8.3: G.711 traffic - forward direction, mean packet inter-arrival time calculated over awindow of 25 packets
188 CHAPTER 8. TRAINING USING SSP-ACT
As shown in Figure 8.2, most packets are 200 bytes long. For M ≥ 80, there are outliers,
which are less than 200 bytes. These are due to the presence of comfort noise packets (each 41
bytes long) within the sliding window. It makes sense that these outliers are only evident when
the conversations are in progress.
Figure 8.3 indicates that most packets arrive at 20ms intervals. However, there are outliers
that indicate a packet inter-arrival time of more than 20ms. These longer packet inter-arrival
times are due to jitter, packet loss 4 and silent periods during voice conversations.
The analysis of packet length and inter-arrival time statistics for G.711 traffic reveals that,
in most cases, voice packets have the same length with little variation on packet arrival interval.
This makes the voice traffic stable for the duration of a conversation, and symmetric in both
forward and reverse directions.
GSM flows have similar traffic characteristics. Figures 8.4 and 8.5 show the mean packet
length and mean packet inter-arrival time from the caller to callee for GSM traffic with a window
size of 25 packets. Statistics of G.711 traffic in the reverse direction are presented in Appendix
D.
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K FF
5060
7080
9010
0
M (Number of packets offset from the beginning of each flow)
Mea
n pa
cket
leng
th (B
ytes
)
Figure 8.4: GSM traffic - forward direction, mean packet length calculated over a window of25 packets
As shown in Figure 8.4, almost all packets are 73 bytes long. There are ony a few outliers
due to telephone-event packets. Figure 8.5 shows that most packets arrive at 20ms interval.
However, there are outliers indicating packet inter-arrival time of greater than 20ms. These
4By observing discontinuities in RTP sequence numbers, 93% of the recorded flows are missing less than 2%of their packets. The largest observed loss of packets (4.9%) involved a single voice conversation where 729 voiceRTP packets were missed.
8.2. EVALUATION OF SSP-ACT IN IDENTIFYING VOIP TRAFFIC 189
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K FF
020
4060
8010
0
M (Number of packets offset from the beginning of each flow)
Mea
n pa
cket
inte
r−ar
rival
tim
e (m
sec)
Figure 8.5: GSM traffic - forward direction, mean packet inter-arrival time calculated over awindow of 25 packets
longer packet inter-arrival time are due to jitter or packet loss during the voice converations. 5.
However, there are exceptions (indicated by outliers in Figures 8.2 and 8.3, for example).
When silence suppression is enabled, if the sliding window captures packets that cover a silent
period, the occassional presence of comfort noise packets will affect the packet length statistics
such as minimum and mean packet length features. Similarly, the presence of silent periods
affects the packet inter-arrival time statistics within the window.
Figure 8.6 illustrates the impact of silence suppression by focusing on 15 seconds of traffic
in each direction, starting at the 30th second of a 121-second G.711 call. Inspection of the RTP
sequence numbers in each direction reveals that no packets were missed. The vertical y-axis
presents the IP packet length. The voice packet length is 200 bytes in both directions. However,
to make the distinction between the traffic in both directions clearer, I plotted packets from
callee-to-caller lower. In this example, the traffic is asymmetric at position 1 and symmetric at
position 2 in the forward and reverse directions.
My analysis suggests that although voice traffic is usually stable and symmetric in both
directions, there are cases where:
• features calculated on full-flow can differ from those calculated on a small sliding win-
dow;
• features calculated on small sliding window at different points during a flow lifetime can5By observing discontinuities in RTP sequence numbers, 88% of the recorded flows are missing less than 2%
of their packets. The largest observed loss of packets (3.1%) involved a single voice conversation where 189 voiceRTP packets were missed.
190 CHAPTER 8. TRAINING USING SSP-ACT
Time elapsed (seconds)
Pac
ket l
engt
h (B
ytes
)
30
35
40
45
41
200
200
Comfort noise packets in caller- to-callee direction
Voice packets in callee-to- caller direction
Sliding window position 1: Asymmetric in two directions
Sliding window position 2: Symmetric in two directions
Pac
ket l
engt
h (B
ytes
)
41
200
200
Voice packets in caller-to- callee direction
Figure 8.6: Voice traffic generated during a voice conversation: Comfort noise packets andsilence suppression periods during a conversation can create asymmetry and multiple packetsizes within the traffic captured by the sliding window
be different from each other; and
• features calculated in one direction can be different from those calculated in the other
direction.
In the following section I present an experimental analysis that demonstrates that SSP-ACT
can produce a more accurate classifier compared to training on full-flow. In addition, when
classifying VoIP against other traffic whose statistical properties vary greatly during a flow’s
lifetime and are asymmetric in the forward and reverse directions, SSP-ACT identifies Other
traffic better, and hence can outperform training on full-flow in terms of Precision.
8.2.3 Results and analysis
To maintain consistency with the work in previous chapters, I choose the same sliding window
size N of 25 packets. With a 20ms packet interval in one direction, this is equivalent to a
maximum of 0.5 second to collect 25 packets in a talk spurt 6. Training and testing datasets are6This is the worst case when there is only voice traffic in one direction. It should be only 0.25 second when
there is traffic in both directions.
8.2. EVALUATION OF SSP-ACT IN IDENTIFYING VOIP TRAFFIC 191
constructed using the same approach described in Chapter 7. Let M be the number of packets
offset from the beginning of a flow. Nineteen sub-flows were selected for the clustering process
with 0 ≤M ≤ 90 and 1,000 ≤M ≤ 9,000, assuming that this would capture the early and in-
progress phases of the voice calls. EM found five clusters among these 19 sub-flows’ instances.
From these clusters, five representative sub-flows, including SF-0, SF-10, SF-30, SF-1000 and
SF-3000, were chosen to train the Naive Bayes and C4.5 Decision Tree classifiers. The Other
traffic used for training was the same as that outlined in Chapters 5, 6 and 7.
Figure 8.7 shows Recall for VoIP traffic for the Naive Bayes classifiers trained on full-flow
and SSP-ACT. Recall for the former is 100% for all M values. For the latter, 18 out of 19
positions of the sliding window have Recall of 100%, with the remaining position with Recall
of 99.6% (median Recall of 100%).
Rec
all (
%)
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
M (Number of packets offset from the beginning of each flow)
50
60
70
80
90
100
Full−flowSSP−ACT
Figure 8.7: VoIP Recall: Naive Bayes classifiers trained on full-flow and SSP-ACT
However, training on full-flow shows a very low Precision of less than 6% for all positions
of the sliding window, as shown in Figure 8.8. The classifier trained using SSP-ACT, on the
other hand, results in a good median Precision of 95.4%, averaged on all positions of the sliding
window. Precision is higher with smaller M values, and slightly decreases as M increases. This
is because the number of VoIP instances for testing reduces when M increases (as explained in
Section 6.4). The number of false positives is a constant (as the same instances of Other traffic
are used for testing all M values). Precision, therefore, is only dependent on the number of true
positives for each M value. When M increases there are fewer flows longer than M+N packet.
192 CHAPTER 8. TRAINING USING SSP-ACT
Pre
cisi
on (%
)
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
M (Number of packets offset from the beginning of each flow)
0102030405060708090
100
Full−flowSSP−ACT
Figure 8.8: VoIP Precision: Naive Bayes classifiers trained on full-flow and SSP-ACT
Consequently, there are fewer VoIP flows for testing and the TP for VoIP traffic is reduced. This
explains why Precision reduces when M increases.
Similar results are found with C4.5 Decision Tree classifiers. As shown in Figure 8.9, train-
ing on full-flow exhibits a median Recall of 99.6%, while training using SSP-ACT shows a
median Recall of 95.7% for all M values. While training on full-flow produces higher Recall
than training using SSP-ACT, the higher Recall is less meaningful when we look at the Preci-
sion.
Rec
all (
%)
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
M (Number of packets offset from the beginning of each flow)
50
60
70
80
90
100
Full−flowSSP−ACT
Figure 8.9: VoIP Recall: C4.5 Decision Tree classifiers trained on full-flow and SSP-ACT
Figure 8.10 summarises Precision for the C4.5 Decision Tree classifiers trained on full-
8.2. EVALUATION OF SSP-ACT IN IDENTIFYING VOIP TRAFFIC 193
flow and SSP-ACT. The classifier trained on full-flow demonstrates very low Precision, of less
than 5.6% for all positions of the sliding window. In contrast, the classifier trained using SSP-
ACT results in a significantly higher median Precision of 99.2% for all positions of the sliding
window.P
reci
sion
(%)
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
M (Number of packets offset from the beginning of each flow)
0102030405060708090
100
Full−flowSSP−ACT
Figure 8.10: VoIP Precision: C4.5 Decision Tree classifiers trained on full-flow and SSP-ACT
The high Recall and low Precision when using classifiers trained on full-flow can be ex-
plained as follows. As most VoIP flows are stable and symmetric in the forward and reverse
directions, training on the full-flow is good enough to identify VoIP traffic even when classify-
ing on a small window. Hence the high Recall for VoIP traffic.
However, the statistical characteristics of Other traffic vary during their flow’s lifetime and
are asymmetric in the forward and reverse directions (see Appendix A). When classifying using
a small sliding window, training on the full-flow fails to distinguish the Other traffic. Both
the Naive Bayes and C4.5 Decision Tree classifiers trained on full-flow misclassify a good
number of Other traffic as VoIP traffic, hence their very poor Precision. In contrast, classifiers
trained using SSP-ACT can identify both VoIP and Other traffic well when classifying on a
small window – hence their good results in both Recall and Precision.
Figure 8.9 reveals that training using SSP-ACT can sometimes produce slightly lower Recall
than training on full-flow. This is because features calculated on sub-flow can differ from those
calculated on full-flow, for both VoIP and Other traffic. The distinction between VoIP and Other
traffic features calculated on sub-flow may not be as significant as when calculated on full-flow.
194 CHAPTER 8. TRAINING USING SSP-ACT
Training on sub-flow hence generates more conservative classification rules than training on
full-flow. While these rules produce high Precision for VoIP traffic, they come with a trade-off
of more false negatives, resulting in a lower Recall compared to training on full-flow.
This explanation is illustrated in Figure 8.11. In Figure 8.11(a), the classification model
trained on full-flow covers a greater range of VoIP instances. When training using SSP-ACT,
due to the differences in sub-flows’ feature values compared to those of full-flows, the classi-
fication model created covers a smaller range of VoIP sub-flow instances (as shown in Figure
8.11(b)). With VoIP traffic, the difference between feature values calculated on sub-flows and
full-flows are less significant than in the case of Other traffic. Hence, the full-flow model can
produce high Recall but very low Precision when classified using a small sliding window. Train-
ing using SSP-ACT provides much greater Precision, with a little trade-off in Recall due to a
greater number of false negatives.
There are also trade-offs in Recall and Precision, due to the particular internal construction
of an ML algorithm. When both being trained using SSP-ACT, Naive Bayes classifier has
a slightly higher Recall and slightly lower Precision than the C4.5 Decision Tree classifier.
Deeper exploration into this area is a subject for future research.
To sum up, SSP-ACT has been demonstrated to benefit VoIP traffic, whose statistical char-
acteristics differ significantly from those of ET. Training using SSP-ACT produces a more accu-
rate classifier in terms of both Recall and Precision when identifying VoIP traffic among other
interference traffic, whose flows’ statistical characteristics vary over their flows’ lifetime and
are asymmetric in the forward and reverse directions.
8.3 Evaluation of SSP-ACT in the presence of additional packet loss
In this section I present my preliminary results from an investigation of the robustness of SSP-
ACT. Like any other IP traffic classification approach which relies on the statistical properties
of traffic at the network layer, the performance of a classifier trained using SSP-ACT could be
affected when deployed in a variety of network environments. Network layer perturbations such
as packet loss, packet re-ordering, traffic shaping, packet fragmentation, and jitter are likely to
result in variations of the feature values.
My preliminary investigation is on the impact of packet loss, as packet loss will have an
8.3. EVALUATION OF SSP-ACT IN THE PRESENCE OF ADDITIONAL PACKET LOSS195
VoIP f ull - flow Other full - flow
Region of VoIP covered by the full - flow model
(a) Classification model for VoIP traffic, trained on full-flow
Vo IP sub - flows Other sub - flows
Region of VoIP covered by SSP - ACT model
(b) Classification model for VoIP traffic, trained usingSSP-ACT
Region of VoIP covered by SSP - ACT model
Region of VoIP covered by the full - flow model
VoIP traffic in the sliding window
Other traffic in the sliding window
(c) Classifying VoIP using classifiers trained on full-flowand SSP-ACT
Figure 8.11: VoIP classification using classifiers trained on full-flow and SSP-ACT: Trainingon full-flow may cover a larger area of VoIP instances when classifying using a small slidingwindow, hence resulting in higher Recall but lower Precision compared to training using SSP-ACT. (The data points are artifically created for illustration purposes only. They are not actualdata points from my dataset.)
196 CHAPTER 8. TRAINING USING SSP-ACT
impact on values of features that are based on packet length and packet arrival time statistics.
For example, a loss of a few packets in a sliding window would result in longer gaps in packet
arrivals, which consequently affects the packet inter-arrival time statistics. Similarly, the loss
of large or small packets would affect the packet length statistics, and so on. An example of
packet loss on VoIP traffic is illustrated in Figure 8.12. A loss of packet 2 increases the packet
inter-arrival time from T to 2T.
1 2 3 1 3
T T 2T
Packet inter-arrival time is T
Packet inter-arrival time becomes 2T
Packet 2 is lost
Figure 8.12: A simple illustration of the impact of packet loss on packet inter-arrival timestatistics
These changes are likely to have an impact on the classifier’s performance, in terms of both
Precision and Recall. Packet loss on the target application can result in a greater number of
false negatives, and hence can lower the classifier’s Recall. Packet loss on Other applications
can result in a greater number of false positives, which can thus lower the Precision for the
classification of the application of interest.
Different packet loss patterns (such as random or bursty losses [191]) can also have different
impacts on the changes in features values compared to when there is no packet loss. In this
initial evaluation, I focus on the impact of random, independent packet loss. Future work can
expand on this through an evaluation of other loss patterns and loss processes (for example, the
Gilbert model as described in [192]).
In this section, I study the changes in Precision and Recall of a classifier trained using
SSP-ACT and classifing using a testing dataset which is tampered by the inclusion of synthetic
packet loss. From the original test trace file, I create synthetic loss by randomly skipping packets
when calculating features statistics (I assume here that any given packet may be lost with a pre-
specified probability p, and that these random losses are independent).
I consider the classification of both ET and VoIP traffic. These applications are not reactive
to packet loss (no significant flow-control or packet retransmissions occur at the application or
transport layers). On the other hand, many Other applications are TCP-based, consequently
8.3. EVALUATION OF SSP-ACT IN THE PRESENCE OF ADDITIONAL PACKET LOSS197
they exhibit network-layer retransmissions and adjustment of their sending rate in the presence
of packet loss. The loss-reactive nature of TCP flows makes it hard to predict the impact of
additional (synthetic) loss on previously calculated feature values. Therefore, in this preliminary
study, the synthetic loss is only applied for ET and VoIP traffic in the test dataset, not for the
Other traffic. In so doing, my study is focused on the evaluation of the impact of packet loss
on Recall for both ET and VoIP traffic. The experimental Precision results may be higher than
the actual Precision achieved when packet loss also occurs with Other traffic. This work can
be expanded in future with the application of more comprehensive packet loss models to both
application of interest and Other interference traffic.
In terms of selecting a realistic loss rate to test, the studies of [193] show that due to the
enormous diversity of the Internet, only a few studies are agreed on the average packet loss rate
and the average loss burst length (i.e. the number of packets lost in a row). Among the works
reviewed in [193], in the period 1998-1999, the average Internet packet loss was reported to
vary between 0.36% and 11%, depending on the particular studies of each work ([194], [195],
[196] and [197]). A 2007 study characterising residential broadband networks [198] reveals
that both cable and DSL have remarkably low packet loss rates, of less than 1% for more than
95% of all broadband paths. The data was drawn from 1,894 broadband hosts from 11 major
cable and DSL providers in North America and Europe. The studies of [199] and [200] indicate
that a small packet loss of 1% or 2% would lead to dramatic degradation in TCP throughput.
These findings suggest that an ISP should not tolerate a greater loss on their access links, as it
would likely trigger complaints from consumers. The analysis of my VoIP dataset also shows a
very low packet loss rate.
My assumption is that the greater the packet loss, the greater the impact on feature values,
and hence might possibly result in a more significant impact on the classifier’s performance.
Therefore, I chose to study the impact of a 5% packet loss (total loss in both directions for
bi-directional traffic), to approximate a reasonable upper bound on tolerable packet loss in an
Internet access link. I apply a 5% packet loss to the ET dataset used in Chapters 5 and 7 and
to the VoIP dataset used earlier in section 8.2. The impact of other loss rates remains for future
research.
For a small sliding window of 25 packets, a 5% packet loss typically results in the loss of
198 CHAPTER 8. TRAINING USING SSP-ACT
only 1 - 2 packets. This small loss therefore has little impact on ET and VoIP feature values.
With ET traffic, the median of the features calculated with packet loss is only slightly greater
than for those calculated on the original test dataset. For example, the median of the max-
imum packet inter-arrival time feature over a 25-packets window for all sub-flow instances is
increased by only∼ 2ms (0.2%) compared to the original test dataset. Analysis of other features
shows similar results.
With VoIP traffic, the median of the maximum packet inter-arrival time feature for all sub-
flows instances increased by ∼ 18.7ms (88.3%) 7 compared to the original test dataset. How-
ever, a 5% packet loss only changes the mean packet inter-arrival time and minimum packet
inter-arrival time features very slightly. On average, with packet loss, the median of minimum
packet inter-arrival time feature for all sub-flows instances increased by only ∼ 0.8ms (4%)
compared to the original test dataset. A 5% packet loss also has little impact on the maximum,
minimum and mean packet length features. This is because VoIP traffic is quite stable in terms
of packet length and arrival statistics, so a small loss of 1-2 packets out of a 25-packet window
would have a lesser impact on packet length statistics.
The following sections report on the effects of these changes on the classification of ET and
VoIP traffic.
8.3.1 Impact of packet loss on the classification of ET traffic
Figure 8.13 shows Recall for the Naive Bayes classifier both with and without a 5% synthetic
loss applied to the test dataset. The classification model is trained with SSP-ACT as described
in Chapter 7. It is recorded for the position of the sliding window with regards to the numbers
of packets (M) offset from the beginning of each flow.
As shown in Figures 8.13 and 8.14, a 5% synthetic loss applied to ET traffic caused the
median Recall for all M values to reduce by 0.45%, and the median Precision for all M values
to reduce by 0.4%.
With these slight degradations, the Naive Bayes classifier exhibits good median Recall and
Precision of 98% and 86% respectively. This suggests that a Naive Bayes classifier trained
using SSP-ACT could maintain its performance well with a loss rate of 5% applied to the test
7This is not increased by 100% as in the example given in Figure 8.12, as a bigger gap caused by packet lossmay still be less than the packet gap caused by silent periods withing the sliding window.
8.3. EVALUATION OF SSP-ACT IN THE PRESENCE OF ADDITIONAL PACKET LOSS199
Rec
all (
%)
0 1 2 3 4 5 6 7 8 9
2000
2001
2002
2003
2004
2005
2006
2007
2008
M (Number of packets offset from the beginning of each flow)
50
60
70
80
90
100
No synthetic loss 5% synthetic loss
Figure 8.13: ET Recall: Training with SSP-ACT and classifying with ET traffic experiencing5% random packet loss - Naive Bayes classifier
Pre
cisi
on (%
)
0 1 2 3 4 5 6 7 8 9
2000
2001
2002
2003
2004
2005
2006
2007
2008
M (Number of packets offset from the beginning of each flow)
50
60
70
80
90
100
No synthetic loss 5% synthetic loss
Figure 8.14: ET Precision: Training with SSP-ACT and classifying with ET traffic experiencing5% random packet loss - Naive Bayes classifier
200 CHAPTER 8. TRAINING USING SSP-ACT
dataset.
Figures 8.15 and 8.16 show Recall and Precision for the C4.5 Decision Tree classifier both
with and without a 5% synthetic loss applied to the test dataset. On average, a 5% synthetic loss
applied to ET traffic caused the median Recall for all M values to reduce by only 0.5%, and the
median Precision for all M values to reduce by only 0.25%.
Rec
all (
%)
0 1 2 3 4 5 6 7 8 9
2000
2001
2002
2003
2004
2005
2006
2007
2008
M (Number of packets offset from the beginning of each flow)
50
60
70
80
90
100
No synthetic loss 5% synthetic loss
Figure 8.15: ET Recall: Training with SSP-ACT and classifying with ET traffic experiencing5% random packet loss - C4.5 Decision Tree classifier)
To summarise, despite the slight degradation due to a 5% packet loss, the C4.5 Decision Tree
classifier still demonstrates a good median Recall and Precision of 97% and 96.7% respectively.
This suggests the C4.5 Decision Tree classifier trained using SSP-ACT can also maintain its
performance well with a loss rate of 5%.
8.3.2 Impact of packet loss on the classification of VoIP traffic
Figure 8.17 depicts Recall for a Naive Bayes classifier trained using SSP-ACT and tested both
with and without a 5% synthetic loss for VoIP traffic. The results are recorded for the position
of the sliding window with regards to the numbers of packets (M) offset from the beginning
of each flow. A 5% synthetic loss applied to VoIP traffic does not have a noticeable impact on
Recall for the Naive Bayes classifier. Recall remains the same at more than 99.6% for all M
values.
8.3. EVALUATION OF SSP-ACT IN THE PRESENCE OF ADDITIONAL PACKET LOSS201
Pre
cisi
on (%
)
0 1 2 3 4 5 6 7 8 9
2000
2001
2002
2003
2004
2005
2006
2007
2008
M (Number of packets offset from the beginning of each flow)
50
60
70
80
90
100
No synthetic loss 5% synthetic loss
Figure 8.16: ET Precision: Training with SSP-ACT and classifying with ET traffic experiencing5% random packet loss - C4.5 Decision Tree classifier
Rec
all (
%)
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
M (Number of packets offset from the beginning of each flow)
50
60
70
80
90
100
No synthetic loss5% synthetic loss
Figure 8.17: VoIP Recall: Training with SSP-ACT and classifying with VoIP traffic experienc-ing 5% random packet loss - Naive Bayes classifier
202 CHAPTER 8. TRAINING USING SSP-ACT
Not surprisingly, there was no noticeable impact on Precision, as shown in Figure 8.18. This
is because Recall is maintained, and the same test dataset is used for Other traffic in both tests.
Pre
cisi
on (%
)
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
M (Number of packets offset from the beginning of each flow)
50
60
70
80
90
100
No synthetic loss5% synthetic loss
Figure 8.18: VoIP Precision: Training with SSP-ACT and classifying with VoIP traffic experi-encing 5% random packet loss - Naive Bayes classifier
With a 5% packet loss, the Naive Bayes classifier trained using SSP-ACT maintains its
performance in terms of both Precision and Recall for VoIP traffic. This is because while
packet loss does impact on packet inter-arrival time statistics, the longer packet inter-arrival
time caused by packet loss may simply look similar to a long packet gap due to silent periods
during a voice conversation or jitter on the network. Flows used in training contain silence
suppression periods, jitter, and even packet loss 8. For this reason, the classifier has a good
chance of maintaining its performance even with a 5% packet loss.
However, this also depends on the internal construction of a particular ML algorithm. In
contrast to the Naive Bayes classifier, the C4.5 Decision Tree classifier shows a significant
negative impact of packet loss on Recall.
Figures 8.19 and Figure 8.20 show Recall and Precision for the C4.5 Decision Tree classifier
respectively.
As shown in Figure 8.19, a 5% packet loss applied to VoIP traffic reduced median Recall
8 Filtering out these flows in training would make the test results clearer; however, it would also reduce thenumber of instances for training significantly. Repeating the experiments with absolutely no packet loss in thetraining dataset is left for future work.
8.3. EVALUATION OF SSP-ACT IN THE PRESENCE OF ADDITIONAL PACKET LOSS203
Rec
all (
%)
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
M (Number of packets offset from the beginning of each flow)
50
60
70
80
90
100
No synthetic loss5% synthetic loss
Figure 8.19: VoIP Recall: Training with SSP-ACT and classifying with VoIP traffic experienc-ing 5% random packet loss - Naive Bayes classifier - C4.5 Decision Tree classifier
Pre
cisi
on (%
)
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
M (Number of packets offset from the beginning of each flow)
50
60
70
80
90
100
No synthetic loss5% synthetic loss
Figure 8.20: VoIP Precision: Training with SSP-ACT and classifying with VoIP traffic experi-encing 5% random packet loss - Naive Bayes classifier
204 CHAPTER 8. TRAINING USING SSP-ACT
by up to 8.5% for all M values. Median Precision was only slightly reduced by 0.1% due to
the variation in Recall (a reduction in the number of true positives) for VoIP traffic. The only
slight reduction in Precision versus the significant reduction in Recall is due to the relatively
very small number of false positives versus the larger number of false negatives. Despite these
degradations, median Recall and Precision for the C4.5 Decision Tree classifier still remained
above 87% and 99% respectively.
The C4.5 Decision Tree classifier is more sensitive to packet loss than the Naive Bayes
classifier because of differences between the internal mechanisms of each ML algorithm. The
former builds a tree based on precise differences in feature values, while the latter builds a
model based on approximations. For the C4.5 Decision Tree, a small change in feature values
can lead to different sub-tree paths at test nodes within the tree, which can subsequently lead to
a different classification result [119] (as noted in the discussion in section 3.1.3 on the instability
of decision tree algorithms). On the other hand, the Naive Bayes classifier is more tolerant to
small variations in feature values. Deeper investigation into the sensitivity of the Naive Bayes
and C4.5 Decision Tree classifiers to packet loss is left for future research.
My experimental results suggest the robustness of the SSP-ACT training approach when
classifying with a 5% packet loss. This work could be extended with the use of different loss
models and packet loss rates. However, since I propose classifying on a small sliding window,
a classifier seeing a burst of packet loss may perform as if classifying traffic when the first part
of the flow is missed. Training using SSP-ACT can help the classifier to maintain performance
in this case. Further exploration of this issue and evaluation of SSP-ACT with changes in other
network environment parameters is left for future work.
8.4 Concurrent classification of multiple applications with SSP-ACT
So far, SSP-ACT has only been applied to the identification of an application of interest. For
further development, it should be evaluated in terms of the identification of multiple applications
simultaneously.
This evaluation idea can be illustrated using a simple scenario presented in Figure 3.4. A
home network generates Game (e.g. ET), VoIP and other common Internet protocols (such as
P2P, Web, SSH and SMTP) traffic. Previously I considered only VoIP or only Game traffic as
8.4. CONCURRENT CLASSIFICATION OF MULTIPLE APPLICATIONS WITH SSP-ACT205
priority traffic.
However, there may be cases where both VoIP and ET are specified as priority traffic. To
improve the flexibility of users’ options, VoIP and ET also may require different priority levels,
in which case it might be desirable that they be identified as two separate classes. This then
leads to the requirement for three-classes classification: VoIP versus ET versus Other.
For the simultaneous identification of VoIP and ET traffic, there are a number of possible
solutions. In this section, I consider two possible options. The first option is to use a common
classifier to classify both types of traffic (I shall refer to this option as ‘Option A - Common
classifier’). The second option is to use two separate classifiers in parallel; each classifies one
type of traffic (I shall refer to this option as ‘Option B - Separate classifiers in parallel’). Figure
8.21 and Figure 8.22 illustrate Option A and Option B respectively.
In Option A, VoIP, ET and other traffic (such as Web, P2P, SSH and SMTP) are labelled
separately as VoIP, ET and Other classes to train the classifier. Only one classification model is
used to classify a new traffic flow as either VoIP or ET or Other traffic.
In Option B, two classifiers are required to classify a new flow as {ET or Other} and {VoIP
or Other} traffic. The difference in this option compared to Option A is that the task for each
classifier is simpler. Each classifier only identifies two classes, rather than three classes simul-
taneously. Option A thus may require a more powerful, centralised processing unit than Option
B. The logical AND operation is needed to avoid a situation of overriding the classification re-
sults of classifier 1 and classifier 2 for the identification of ET and VoIP traffic. For example, if
there were an ET flow, classifier 2 would indicate ‘ET’ and classifier 1 would indicate ‘Other’.
The correct output for the overall system would be ‘ET’.
In order to make a decision over which option to use, we need to compare and evaluate
the performance of classifiers trained using two options, based on the operational challenges
addressed in section 3.3.2. My work as outlined in section 7.4.3 offers insights into the pros
and cons of four-class classification versus two-class classification. In the case of section 7.4.3
four-class classification results in better trade-offs between Precision and Recall 9. However, its
comes with longer model training time and slower classification speed. This finding is consis-
tent with my results reported below.
9The level of improvement is dependent on the particularities of the ML algorithm.
206 CHAPTER 8. TRAINING USING SSP-ACT
VoIP
ET
Web, P2P, SSH, SMTP
Traffic classifier
VoIP
Other
ET
Training
Classification model VoIP or ET or Other
VoIP sample traffic
ET sample traffic
Web, P2P, SSH, SMTP sample traffic
Labelled VoIP class
Labelled ET class
Labelled Other class
Figure 8.21: Training for VoIP and ET traffic identification: Option A - Common classifier
8.4. CONCURRENT CLASSIFICATION OF MULTIPLE APPLICATIONS WITH SSP-ACT207
VoIP
ET
Web, P2P, SSH, SMTP
Classifier 1
VoIP
Other
Training
Classification model VoIP or Other
VoIP sample traffic ET sample traffic Web, P2P, SSH,
SMTP sample traffic
Labelled VoIP class
Labelled Other class
Classifier 2
ET
Training
Classification model ET or Other
ET sample traffic VoIP sample traffic
Web, P2P, SSH, SMTP sample traffic
Labelled Other class
Labelled ET class
AND operation
Figure 8.22: Training for VoIP and ET traffic identification: Option B - Separate classifiers inparallel
208 CHAPTER 8. TRAINING USING SSP-ACT
Rec
all (
%)
0 1 2 3 4 5 6 7 8 9
2000
2001
2002
2003
2004
2005
2006
2007
2008
M (Number of packets offset from the beginning of each flow)
50
60
70
80
90
100
Option A − Common classifierOption B − Separate classifiers in parallel
(a) Recall - Naive Bayes classifierP
reci
sion
(%)
0 1 2 3 4 5 6 7 8 9
2000
2001
2002
2003
2004
2005
2006
2007
2008
M (Number of packets offset from the beginning of each flow)
50
60
70
80
90
100
Option A − Common classifierOption B − Separate classifiers in parallel
(b) Precision - Naive Bayes classifier
Figure 8.23: VoIP Recall and Precision: Naive Bayes classifier
Rec
all (
%)
0 1 2 3 4 5 6 7 8 9
2000
2001
2002
2003
2004
2005
2006
2007
2008
M (Number of packets offset from the beginning of each flow)
50
60
70
80
90
100
Option A − Common classifierOption B − Separate classifiers in parallel
(a) Recall - C4.5 Decision Tree classifier
Pre
cisi
on (%
)
0 1 2 3 4 5 6 7 8 9
2000
2001
2002
2003
2004
2005
2006
2007
2008
M (Number of packets offset from the beginning of each flow)
50
60
70
80
90
100
Option A − Common classifierOption B − Separate classifiers in parallel
(b) Precision - C4.5 Decision Tree classifier
Figure 8.24: VoIP Recall and Precision: C4.5 Decision Tree classifier
8.4. CONCURRENT CLASSIFICATION OF MULTIPLE APPLICATIONS WITH SSP-ACT209
Figure 8.23 and Figure 8.24 present Recall and Precision respectively for VoIP traffic iden-
tified by the Naive Bayes and C4.5 Decision Tree classifiers trained using SSP-ACT. As seen
in the figures, Option A results in better trade-offs between Precision and Recall for both the
Naive Bayes and C4.5 Decision Tree classifiers.
In the case of the Naive Bayes classifier, Option A has similar Recall as Option B but
slightly better Precision due to the reduction in the number of false positives. For the C4.5
Decision Tree classifier, Option A’s Precision is slightly lower compared to Option B, yet its
Recall is significantly higher. This result is consistent with the findings in section 7.4.3 and
occurs for the same reasons. In Option B, ET traffic samples and other traffic samples are
joined to form one single Other class to train a classifier. This may create an unwanted area of
feature values in training the classifier. In Option A, this unwanted area is removed, which helps
improve Precision for the Naive Bayes classifier, and improves Recall for the C4.5 Decision
Tree classifier with a slight trade-off in Precision.
Naive Bayes C4.5 Decision Tree
Nor
mal
ised
Mod
el B
uild
Tim
e
00.
20.
40.
60.
81
Option A − Common classifierOption B − Separate classifiers in parallel
(a) Normalised Model Build Time
Naive Bayes C4.5 Decision Tree
Nor
mal
ised
Cla
ssifi
catio
n S
peed
00.
20.
40.
60.
81
Option A − Common classifierOption B − Separate classifiers in parallel
(b) Normalised Classification Speed
Figure 8.25: Computational performance for Naive Bayes and C4.5 Decision Tree classifierstrained with Option A and Option B
Figure 8.25 presents the normalised model build time and classification speed for the Naive
Bayes and C4.5 Decision Tree classifiers with Option A and Option B. The value of 1 rep-
resents the longest time taken to build a classification model of 789 seconds, and the highest
classification speed of 19,073 instances per second.
In terms of required model build time, training a Naive Bayes classifier using Option A takes
approximately 10% longer than when using Option B. For the C4.5 Decision Tree classifier,
training using Option A takes approximately 60% longer than training a single classifier using
210 CHAPTER 8. TRAINING USING SSP-ACT
Option B to identify VoIP traffic. Yet Option B requires training two classifiers for both ET and
VoIP traffic. This may double the model build time of a single classifier.
In terms of classification speed, using Option A with the Naive Bayes classifier slows down
the classifier slightly (∼ 5%). With the C4.5 Decision Tree classifier, using Option A slows
down the classification speed significantly, by ∼ 60%. The slower classification speed when
using Option A could become an issue when scaling to classify a large number of applications
simultaneously.
Besides the accuracy and computational performance, the differences in statistical traffic
characteristics and QoS requirements of the applications of interest also needs to be consid-
ered. Each application may have a particular optimal classification window size, which would
influence trade-offs between classification timeliness, Recall and Precision, and computational
performance of the classification. For example, my studies indicate that ET and VoIP require
only 25 packets for high Precision and Recall classification. However, another application, such
as video conferencing, might require a different number of packets to obtain an acceptable ac-
curate and timely classification. A common classifier in Option A must balance the trade-offs
between different performance parameters for all applications, while the individual classifiers
in Option B have better opportunities to optimise the parameters for each individual application.
Another drawback of Option B is that there may be conflicts between classification results.
For example, a flow is classified as VoIP by classifier 1 and also is classified as ET by classifier
2. The solution to such a situation is implementation specific. For example, a flow that results
in a conflict in classification results would be classified as belonging to the class with the lower
priority of the two.
Table 8.1 summarises the pros and cons of both options.
My preliminary results and analysis suggest that a common classifier can be used to clas-
sify multiple applications. (In other words, it is not necessary to have a separate classifier for
each application.) However, this approach has pros and cons compared with using a separate
classifier for each individual application. Taking into account these pros and cons, Option A
may still be a simpler and more effective choice if the classification speed is acceptable for a
particular purpose. Furthermore, as suggested by [158], one or two classes classification could
be sufficient for QoS-enabled Internet access networks.
8.4. CONCURRENT CLASSIFICATION OF MULTIPLE APPLICATIONS WITH SSP-ACT211
Table 8.1: Comparison of the pros and cons of Option A: Common classifier versus Option B:Separate classifiers in parallel
Option A: Common classifier Option B: Separate classifiers in parallel
Provides better balance in trade-offs be-tween Precision and Recall
Provides worse balance in trade-offs be-tween Precision and Recall
Slower in classification speed. This can bean issue when scaling to a great number ofapplications
Faster classification speed
One single classifier requires updates whenthe classification model needs to be up-dated
All individual classifiers require updateswhen the classification models need to beupdated
Model building and classification tasks areconcentrated in a single processing compo-nent. This may lead to the requirement fora powerful and expensive processing unitwhen scaling to a great number of applica-tions
Training and classification work-load canbe divided into multiple processing com-ponents. One can make use of cheapercomponents in parallel processing
Must use the same sliding window for allapplications, which makes it harder to op-timise performance for individual applica-tion
Flexible in optimising for individual appli-cations
No conflict in classification result Possible conflict in classification results
212 CHAPTER 8. TRAINING USING SSP-ACT
8.5 Discussion
Since this chapter has outlined my preliminary study of the generality and robustness of SSP-
ACT, there are a number of limitations in my current experimental approach. Further improve-
ment can be made in the following areas:
• In evaluating SSP-ACT’s robustness, only a single 5% random, independent loss has
been studied. Future work could expand on this study through an evaluation that includes
a wide range of loss rates and other loss patterns (such as the Gilbert model as described
in [192]). Furthermore, with my experiments, packet loss is only applied to the traffic of
interest. This work could be extended to study the performance of SSP-ACT when packet
loss occurs to other traffic as well.
• SSP-ACT has only been tested with a limited number of sliding window positions. Its
potential could be further explored through more exhaustive testing throughout the flow’s
lifetime, in order to evaluate the stability of SSP-ACT.
• In evaluating SSP-ACT’s scalability, I have studied its performance in classifying up
to three classes (VoIP versus ET versus Other) simultaneously. How SSP-ACT could
scale to classify a large number (e.g. 100s or 1,000s) of applications simultaneously is a
question that requires further study. Although it is not clear for a business scenario that
requires the identification of 100s applications or QoS classes simultaneously, it would be
of interest for future work to evaluate the trade-offs between accuracy and computational
performance of SSP-ACT when scaling to that extent. Even with the scenario of two QoS
classes, how many applications might be grouped into a single class before the occurrence
of degradation in classification accuracy and computational performance is a valuable
subject for further research.
• In my experiments a sliding window of 25 packets worked well in terms of classification
timeliness, Precision and Recall for both ET and VoIP traffic. However, another applica-
tion may have different requirements which may require a different size for an optimal
classification window. An optimal size for a classification window should balance the
trade-off between the classifier’s Precision and Recall, classification timeliness, classifi-
8.6. CONCLUSION 213
cation speed and processing overhead. Nevertheless, using a common classifier for the
classification of multiple applications simultaneously would require the same sliding win-
dow for all applications. Detailed characterising of this trade-off is a subject for future
research.
8.6 Conclusion
In this chapter, I have demonstrated the effectiveness of SSP-ACT in identifying VoIP traffic.
Training using SSP-ACT produces an accurate classifier, in terms of both Precision and Recall,
when the classifier needs to identify VoIP traffic among other traffic with statistical properties
which vary over their flows’ lifetime, and are asymmetric in the forward and reverse directions.
I have evaluated the robustness of SSP-ACT when classifying with synthetic packet loss.
Both the Naive Bayes and C4.5 Decision Tree classifiers maintain their performance well with
a 5% synthetic packet loss applied to the test dataset. Evaluating SSP-ACT with a larger range
of applications and other loss, delay and jitter models remains for future work.
I also consider the use of SSP-ACT for classifying multiple applications simultaneously. My
preliminary results suggest that it is possible to use a common classifier for multiple applications
classification. However, this approach entails pros and cons versus the option of using a separate
classifier for each individual application. In my experiments, using a common classifier has
been seen to provide better balance in trade-offs between Precision and Recall. It is also easier
to update the classifier. However, this method is slower in classification speed, which can be an
issue when scaling to a greater number of applications. It uses the same sliding window size for
all applications, which makes it harder to optimise performance for individual applications. The
requirement of a powerful central processing unit within this option may also be an issue when
scaling to a greater number of applications. Deeper study on this subject would be valuable for
future research.
Chapter 9
Conclusion
Real-time traffic classification has the potential to solve difficult network management prob-
lems, in the areas of QoS provisioning, Internet accounting and charging, and lawful intercep-
tion. The traditional IP traffic classification (IPTC) techniques that rely mostly on destination
port numbers or ‘deep packet inspection’ are becoming less effective. This has been the mo-
tivation to develop new approaches that classify traffic by learning and recognising statistical
patterns in the externally observable attributes of the traffic1.
My literature review on the use of Machine Learning (ML) in IPTC suggests that ML-based
IPTC has great potential as a new and robust approach. Previously proposed work on ML-based
IPTC has shown very promising results. However, it has not considered the practical constraints
of deployment in real-life operational networks. My thesis, therefore, has filled this research
gap, and showed that statistically based IPTC using ML is a feasible and practical approach.
In this thesis I have proposed and demonstrated a novel solution that an ML classifier should
be trained using a set of short sub-flows extracted from full-flows generated by the target ap-
plication(s), coupled with their mirror-image replicas. My proposal (referred to as SSP-ACT)
was illustrated by considering an ISP that wishes to automatically and quickly detect online
interactive game traffic (ET) or voice (VoIP) traffic mingled in among regular consumer IP
traffic. The results presented in Chapters 5 to 8 revealed that, using a sliding window of 25
packets, the Naive Bayes classifier achieved 98.9% median Recall and 87% median Precision
when classifying ET traffic, and 100% median Recall and 95.4% median Precision when clas-
sifying VoIP traffic. The C4.5 Decision Tree classifier achieved 99.3% median Recall and 97%
1Elaboration on these issues was presented in Chapter 2.
214
215
median Precision when classifying ET traffic, and 95.7% median Recall and 99.2% Precision
when classifying VoIP traffic2. Both these classifiers maintained their performance well regard-
less of how many packets are missed from the beginning of each flow or of the direction of
the first packet of the most recent N packets used for the classification. Compared to the poor
performance of the classifiers trained on full-flows – a common method used in the literature –
these results indicate that SSP-ACT is a significant improvement over the previous, published
state-of-the-art for IP traffic classification. Although the experiments are confined to online
game and VoIP applications, my results reveal a potential solution to the accurate and timely
classification of traffic belonging to other Internet applications.
Furthermore, I have also proposed and demonstrated a novel approach using unsupervised
ML clustering techniques to choose appropriate, representative sub-flows, from which a clas-
sifier may be trained. This extension is significant for the deployment of SSP-ACT to classify
new applications of interest. It eliminates the need for expert knowledge of the application
and relieves the complexity of manually choosing the best combination of sub-flows to train a
classifier. This approach has been demonstrated with the use of the EM algorithm.
In Chapter 6 I showed that manual selection of sub-flows for training was not necessary in
general cases. Instead, sub-flows automatically selected by the EM algorithm could produce
classifiers with slightly better Precision and Recall, and minor trade-offs in classification speed
and computational complexity. I further showed that the clustering process for optimal sub-flow
selection can be up to 99% faster, yet still result in acceptable SSP-ACT classifier performance,
when sub-flow selection is performed on small samples of sub-flow instances (for example, 50
instances per sub-flow).
I also have briefly explored the impact of packet loss on the Naive Bayes and C4.5 Decision
Tree classifiers trained using SSP-ACT. As presented in Chapter 8, with a 5% random, indepen-
dent synthetic packet loss, the Naive Bayes and C4.5 Decision Tree classifiers maintained their
performance well. For ET traffic, a 5% packet loss only degraded Recall and Precision of both
classifiers by less than 0.5%. For VoIP traffic, a 5% packet loss did not produce any notice-
able degradation on the Naive Bayes classifier’s Recall and Precision. However, it degraded the
C4.5 Decision Tree classifier’s Recall and Precision by 8.5% and 0.1% respectively. Despite2ET flows made up from 11.9% to 17.1% of the traffic mix in the test datasets. VoIP flows made up from 1.1%
to 2.4% of the traffic mix in the test datasets.
216 CHAPTER 9. CONCLUSION
this degradation, median Recall and Precision of the C4.5 Decision Tree classifier remained
above 87% and 99% respectively for all the tested positions of the sliding window. Deeper
investigation into the sensitivity of the Naive Bayes and C4.5 Decision Tree classifiers with
regards to packet loss is left for future research, with other loss rates and loss models.
Finally, I have discussed the pros and cons between using a common classifier (Option A)
and multiple classifiers in parallel (Option B) to identify multiple applications simultaneously.
My initial study presented in Chapter 8 investigated the concurrent classification of ET, VoIP
and other traffic. I showed that Option A produced better Precision and Recall for both the
Naive Bayes and C4.5 Decision Tree classifiers. Furthermore, this option does not involve the
potential of conflicting classification results as does Option B, and allows the classifier to be
updated more easily. Yet Option A does come with a significant cost in classification speed
and computational complexity. For example, Option A required greater than 60% more time to
build a C4.5 Decision Tree classifier, and the resulting classifier was 60% slower in classification
speed compared to when using Option B. Using a common classifier for all applications also
makes it more difficult to optimise performance for individual applications. This initial work
can serve as the starting point for future investigation of these issues.
My work can be extended in a number of future research directions. These include:
• Characterising the optimal sliding classification window size (N) for a wider range of
applications (as discussed in sections 5.5 and 8.5);
• Identifying how varying N affects classification accuracy, classification timeliness, clas-
sification speed, and the stability of results for continuous classification (as discussed in
section 5.5);
• Evaluating the stability of classification accuracy in the presence of network perturba-
tions, such as packet loss, delay and packet re-ordering (as briefly outlined in section
8.3);
• Evaluating the impact on classification accuracy of different traffic mixes and the em-
ployment of different ML classification and clustering algorithms (as briefly mentioned
in sections 5.3.4, 5.5, and E.1);
217
• Expanding SSP-ACT for the recognition of new and unknown applications (as discussed
in section 6.6); and
• Evaluating the scalability of SSP-ACT to classify a large number (for example, hundreds)
of applications simultaneously (as discussed in section 8.5)
In summary, my thesis has opened up a new path for research on the optimisation of the use
of ML classifiers in real-time IP traffic classification. I believe my proposal will assist in the
use of ML algorithms inside practical and deployable IP traffic classifiers.
Bibliography
[1] B. Leiner, V. G. Cerf, D. Clark, R. E. Kahn, L. Kleinrock, D. Lynch, J. Postel, L. G.
Roberts, and S. Wolff, “The past and future history of the Internet,” Communications of
the ACM, vol. 40, no. 2, pp. 102–108, 1997.
[2] (2006, January) Worldwide Internet users top 1 billion in 2005. eTForcasts. [Online].
Available: http://www.etforecasts.com/pr/pr106.htm [Last accessed: 2009, 22 February].
[3] J. Zhu and E. Wang, “Diffusion, use, and effect of the Internet in China,” Communica-
tions of the ACM, vol. 48, no. 4, pp. 49–53, 2005.
[4] G. Huston. (2005, March) IPv4 Address utilization. [Online]. Available: http:
//www.potaroo.net/papers/2005-03-ipv4.pdf [Last accessed: 2009, 22 February].
[5] G. Huston. (2005, June) The BGP report for 2005. [Online]. Available: http:
//www.potaroo.net/ispcol/2006-06/bgpupds.html [Last accessed: 2009, 22 February].
[6] M. Fomenkov, K. Keys, D. Moore, and K. Claffy, “Longitudinal study of Internet traffic
in 1998-2003,” in WISICT ’04: Proceedings of the winter international synposium on
Information and communication technologies. Cancun, Mexico: Trinity College Dublin,
2004, pp. 1–6.
[7] R. Kraut, T. Mukhopadhyay, J. Szczypula, S. Kiesler, and W. Scherlis, “Information and
communication: Alternative uses of the Internet in households,” Information Systems
Research, vol. 10, no. 4, pp. 287–303, 1999.
[8] S. E. Stern, S. Gregor, M. A. Martin, S. Goode, and J. Rolfe, “A classification tree anal-
ysis of broadband adoption in Australian households,” in ICEC ’04: Proceedings of the
218
BIBLIOGRAPHY 219
6th international conference on Electronic commerce. Delft, The Netherlands: ACM
Press, October 2004, pp. 451–456.
[9] E. Castronova, “Network technology, markets, and the growth of synthetic worlds,” in
NetGames ’03: Proceedings of the 2nd workshop on Network and system support for
games. Redwood City, California: ACM Press, 2003, pp. 121–134.
[10] L. Chen, M. L. Gillenson, and D. L. Sherrell, “Consumer acceptance of virtual stores:
a theoretical model and critical success factors for virtual stores,” SIGMIS Database,
vol. 35, no. 2, pp. 8–31, 2004.
[11] Y. Chen and J. H. Rankin, “A framework for benchmarking e-procurement in the AEC
industry,” in ICEC 06: Proceedings of the 8th international conference on Electronic
commerce. Fredericton, New Brunswick, Canada: ACM, August 2006, pp. 411–419.
[12] K. Kim and B. Prabhakar, “Initial trust and the adoption of B2C e-commerce: The case
of Internet banking,” SIGMIS Database, vol. 35, no. 2, pp. 50–64, 2004.
[13] S. Bolin, “E-commerce: a market analysis and prognostication,” StandardView, vol. 6,
no. 3, pp. 97–105, 1998.
[14] J. Yang and G. Miao, “The estimates and forecasts of worldwide e-commerce,” in ICEC
’05: Proceedings of the 7th international conference on Electronic commerce. Xi’an,
China: ACM, 2005, pp. 52–56.
[15] A. Ginsberg, P. Hodge, T. Lindstrom, B. Sampieri, and D. Shiau, “The little Web school-
house: using virtual rooms to create a multimedia distance learning environment,” in
MULTIMEDIA 98: Proceedings of the sixth ACM international conference on Multime-
dia. Bristol, United Kingdom: ACM, September 1998, pp. 89–98.
[16] E. McLoughlin, D. O’Sullivan, M. Bertolotto, and D. C. Wilson, “MEDIC: Mobile di-
agnosis for improved care,” in SAC 06: Proceedings of the 2006 ACM symposium on
Applied computing. Dijon, France: ACM Press, April 2006, pp. 204–208.
220 BIBLIOGRAPHY
[17] I. Tomkos and A. Tzanakaki, “Towards digital optical networks,” in Proceedings of 7th
International Conference on Transparent Optical Networks, 2005, vol. 1, Barcelona,
Spain, July 2005, pp. 1–4.
[18] R. Alferness, “The all-optical networks,” in International Conference on Communication
Technology Proceedings (WCC - ICCT 2000), 2000., vol. 1, Beijing, China, August 2000,
pp. 14–15.
[19] (2006, July) Service provider Quality-of-Service overview. Cisco. [Online]. Available:
http://www.cisco.com/warp/public/cc/so/neso/sqso/spqos wp.pdf [Last accessed: 2009,
22 February].
[20] A. Bouch, A. Kuchinsky, and N. Bhatti, “Quality is in the eye of the beholder: meet-
ing users’ requirements for Internet quality of service,” in CHI ’00: Proceedings of the
SIGCHI conference on Human factors in computing systems. The Hague, The Nether-
lands: ACM Press, April 2000, pp. 297–304.
[21] ITU-T, “ITU-T Recommendation G.114: One-way transmission time,” ITU-T G.114
Standard, International Telecommunication Union, 1996.
[22] G. Armitage, “An experimental estimation of latency sensitivity in multiplayer Quake3,”
in The 11th IEEE International Conference on Networks (ICON2003), 2003., Sydney,
Australia, September 2003, pp. 137–141.
[23] J. Nichols and M. Claypool, “The effects of latency on online madden NFL football,”
in NOSSDAV 04: Proceedings of the 14th international workshop on Network and oper-
ating systems support for digital audio and video. New York, NY, USA: ACM Press,
2004, pp. 146–151.
[24] M. Claypool and J. Tanner, “The effects of jitter on the peceptual quality of video,”
in MULTIMEDIA ’99: Proceedings of the seventh ACM international conference on
Multimedia (Part 2). Orlando, Florida, United States: ACM Press, September 1999, pp.
115–118.
BIBLIOGRAPHY 221
[25] G. Armitage and L. Stewart, “Limitations of using real-world, public servers to estimate
jitter tolerance of first person shooter games,” in ACE ’04: Proceedings of the 2004 ACM
SIGCHI International Conference on Advances in computer entertainment technology.
Singapore: ACM Press, June 2004, pp. 257–262.
[26] T. Henderson and S. Bhatti, “Networked games: a QoS-sensitive application for QoS-
insensitive users?” in RIPQoS ’03: Proceedings of the ACM SIGCOMM workshop on
Revisiting IP QoS. Karlsruhe, Germany: ACM Press, 2003, pp. 141–147.
[27] G. Armitage and L. Stewart, “Some thoughts on emulating jitter for user experience
trials,” in NetGames 04: Proceedings of 3rd ACM SIGCOMM workshop on Network
and system support for games. Portland, Oregon, USA: ACM Press, August 2004, pp.
157–160.
[28] S. Zander and G. Armitage, “Empirically measuring the QoS sensitivity of interactive
online game players,” in Proceedings of Australian Telecommunications and Network
Application Conference (ATNAC), December 2004.
[29] M. Dick, O. Wellnitz, and L. Wolf, “Analysis of factors affecting players’ performance
and perception in multiplayer games,” in NetGames ’05: Proceedings of 4th ACM SIG-
COMM workshop on Network and system support for games. Hawthorne, NY: ACM,
October 2005, pp. 1–7.
[30] R. Braden, D. Clark, and S. Shenker, “Integrated Services in the Internet architecture: an
overview,” RFC 1633, IETF, 1994.
[31] S. Blake, D. Black, M. Carlson, E. Davies, Z. Wang, and W. Weiss, “An architecture for
Differentiated Services,” RFC 2475, IETF, 1998.
[32] E. Rosen, A. Viswanathan, and R. Callon, “Multiprotocol Label Switching Architecture,”
RFC 3031, IETF, 2001.
[33] L. Stewart, G. Armitage, P. Branch, and S. Zander, “An architecture for automated net-
work control of QoS over consumer broadband links,” in IEEE TENCON 05, Melbourne,
Australia, November 2005.
222 BIBLIOGRAPHY
[34] M. Oliveira and T. Henderson, “What online gamers really think of the Internet?” in
NetGames ’03: Proceedings of the 2nd workshop on Network and system support for
games. New York, NY, USA: ACM Press, 2003.
[35] J. But, N. Williams, S. Zander, L. Stewart, and G. Armitage, “ANGEL - Automated net-
work games enhancement layer,” in NetGames ’06: Proceedings of 5th ACM SIGCOMM
workshop on Network and system support for games. Singapore: ACM, October 2006,
p. 9.
[36] T. Nguyen and G. Armitage, “Evaluating Internet pricing schemes - A three dimensional
visual model,” ETRI Journal, vol. 27, no. 1, pp. 64–74, February 2005.
[37] T. Nguyen and G. Armitage, “Pricing the Internet - A visual 3-dimensional evaluation
model,” in Australian Telecommunications Networks & Applications Conference 2003
(ATNAC 2003), Melbourne, Australia, December 2003.
[38] M. Karsten, J. Schmitt, C. L. Wolf, and R. Steinmetz, “Cost and price calculation for
Internet Integrated Services,” in Kommunikation in Verteilten Systemen. Springer, 1999,
pp. 46–57.
[39] J. R. Edell and P. P. Varaiya, “Providing Internet access: What we learn from INDEX,”
IEEE Network, vol. 13, no. 5, pp. 18–25, September/October 1999.
[40] B. Stiller, T. Braun, M. Gunter, and B. Plattner, “The CATI Project: Charging and ac-
counting technology for the Internet,” in ECMAST ’99: Proceedings of the 4th European
Conference on Multimedia Applications, Services and Techniques. Springer-Verlag,
1999, pp. 281–296.
[41] J. Frank, “Machine learning and intrusion detection: Current and future directions,” in
Proceedings of the 17th National Computer Security Conference, Baltimore, MD, Octo-
ber 1994.
[42] Bro Intrusion Detection System – Bro Overview, Lawrence Berkeley National
Laboratory, April 2006. [Online]. Available: http://bro-ids.org. [Last accessed: 2009, 22
February].
BIBLIOGRAPHY 223
[43] P. Branch, “Lawful Interception of the Internet,” Australian Journal of Emerging Tech-
nologies and Society, 2003.
[44] A. Milanovic, S. Srbljic, I. Raznjevic, D. Sladden, I. Matosevic, and D. Skrobo, “Meth-
ods for Lawful Interception in IP telephony networks based on H.323,” in Computer as
a Tool. The IEEE Region 8. EUROCON 2003, vol. 1, September 2003, pp. 198–202.
[45] A. Rojas and P. Branch, “Lawful Interception based on sniffers in Next Generation Net-
works,” in Australian Telecommunications Networks & Applications Conference 2004
(ATNAC2004), Sydney, Australia, December 8-10 2004.
[46] P. Branch, A. Pavlicic, and G. Armitage, “Using MAC addresses in the Lawful Intercep-
tion of IP traffic,” in Australian Telecommunications Networks & Applications Confer-
ence 2004 (ATNAC2004), Sydney, Australia, December 2004.
[47] Wolfenstein Enemy Territory, February 2009. [Online]. Available: http://enemy-territory.
4players.de:1041/news.php [Last accessed: 2005, December].
[48] Snort – the de facto standard for intrusion detection/prevention, Sourcefire, Inc., April
2006. [Online]. Available: http://www.snort.org [Last accessed: 2009, 22 February].
[49] V. Paxson, “Bro: A system for detecting network intruders in real-time,” Computer Net-
works, no. 31 (23-24), pp. 2435–2463, 1999.
[50] F. Baker, B. Foster, and C. Sharp, “Cisco architecture for Lawful Intercept in IP net-
works,” RFC 3924, Internet Engineering Task Force IETF, October 2004.
[51] T. Karagiannis, A. Broido, N. Brownlee, and K. Claffy, “Is P2P dying or just hiding?” in
IEEE Global Telecommunications Conference (GLOBECOM ’04), 2004., vol. 3, Dallas,
Texas, USA, November/December 2004, pp. 1532–1538.
[52] S. Sen, O. Spatscheck, and D. Wang, “Accurate, scalable in-network identification of P2P
traffic using application signatures,” in WWW ’04: Proceedings of the 13th international
conference on World Wide Web. New York, NY, USA: ACM, May 2004, pp. 512–521.
224 BIBLIOGRAPHY
[53] T. Karagiannis, K. Papagiannaki, and M. Faloutsos, “BLINC: multilevel traffic classifica-
tion in the dark,” in SIGCOMM ’05: Proceedings of the 2005 conference on Applications,
technologies, architectures, and protocols for computer communications. Philadelphia,
Pennsylvania, USA: ACM, August 2005, pp. 229–240.
[54] D. Bonfiglio, M. Mellia, M. Meo, D. Rossi, and P. Tofanelli, “Revealing Skype traffic:
when randomness plays with you,” in SIGCOMM ’07: Proceedings of the 2007 confer-
ence on Applications, technologies, architectures, and protocols for computer communi-
cations. Kyoto, Japan: ACM, August 2007, pp. 37–48.
[55] K. Papagiannaki, N. Taft, S. Bhattacharyya, P. Thiran, K. Salamatian, and C. Diot, “A
pragmatic definition of elephants in Internet backbone traffic,” in IMW ’02: Proceedings
of the 2nd ACM SIGCOMM Workshop on Internet measurment. Marseille, France:
ACM, 2002, pp. 175–176.
[56] N. Brownlee and K. Claffy, “Understanding Internet traffic streams: Dragonflies and
tortoises,” IEEE Communications Magazine, vol. 40, no. 10, pp. 110–117, 2002.
[57] S. Sarvotham, R. Riedi, and R. Baraniuk, “Connection-level analysis and modeling of
network traffic,” in IMW ’01: Proceedings of the 1st ACM SIGCOMM Workshop on
Internet Measurement. San Francisco, California, USA: ACM, 2001, pp. 99–103.
[58] A. Soule, K. Salamatian, N. Taft, R. Emilion, and K. Papagiannaki, “Flow classification
by histograms or how to go on safari in the Internet,” in ACM SIGMETRICS Performance
Evaluation Review, vol. 32, no. 1. New York, NY, USA: ACM, 2004, pp. 49–60.
[59] A. McGregor, M. Hall, P. Lorier, and J. Brunskill, “Flow clustering using machine learn-
ing techniques,” in Passive and Active Measurement (PAM) Conference, 2004, Antibes
Juan-les-Pins, France, April 2004.
[60] S. Zander, T. Nguyen, and G. Armitage, “Automated traffic classification and application
identification using machine learning,” in IEEE 30th Conference on Local Computer
Networks (LCN 2005), Sydney, Australia, November 2005, pp. 250–257.
BIBLIOGRAPHY 225
[61] M. Roughan, S. Sen, O. Spatscheck, and N. Duffield, “Class-of-service mapping for
QoS: a statistical signature-based approach to IP traffic classification,” in IMC ’04: Pro-
ceedings of the 4th ACM SIGCOMM conference on Internet measurement. Taormina,
Sicily, Italy: ACM, October 2004, pp. 135–148.
[62] B. Choi, S. Moon, Z. Zhang, K. Papagiannaki, and C. Diot, “Analysis of point-to-point
packet delay in an operational network,” in INFOCOM 2004. Twenty-third Annual Joint
Conference of the IEEE Computer and Communications Societies, Hong Kong, March
2004, pp. 1797–1807.
[63] T. Nguyen and G. Armitage, “Experimentally derived interactions between TCP traffic
and service quality over DOCSIS cable links,” in IEEE Global Telecommunications Con-
ference (GLOBECOM ’04), 2004., vol. 3, Texas, USA, November/December 2004, pp.
1314–1318.
[64] T. Nguyen and G. Armitage, “Quantitative assessment of IP service quality in 802.11b
and DOCSIS networks,” in Proceedings of Australian Telecommunications Networks &
Applications Conference 2004 (ATNAC 2004), Sydney, Australia, December 2004.
[65] T. Nguyen and G. Armitage, “Quantitative assessment of IP service quality in 802.11b
networks,” in Telecommunications and Signal Processing (WITSP’04), Adelaide, Aus-
tralia, December 2004.
[66] CableLabs, “Data-Over-Cable Service Interface Specifications Radio Frequency Inter-
face Specification,” SP-RFIv1.1-I01-990311, 1999.
[67] IEEE, “ANSI/IEEE Std 802.11, 1999 edition,” ISO/IEC 8802-11: 1999, 1999.
[68] R. Braden, D. Clark, and S. Shenker, “Resource Reservation Protocol (RSVP) version 1
functional aspects,” RFC 2205, IETF, 1997.
[69] G. Armitage, Quality of Service In IP Networks: Foundations for a Multi-Service Inter-
net. Macmillan Technical Publishing, 2000.
[70] StreamEngine, Ubicom, June 2008. [Online]. Available: http://streamengine.ubicom.
com/ [Last accessed: 2009, 22 February].
226 BIBLIOGRAPHY
[71] (2008, June) D-link GamerLounge - Product Categories. D-Link. [Online]. Available:
http://games.dlink.com/products/?pid=370\&#DGL-4300 [Last accessed: 2009, 22
February].
[72] (2008, June) AutoQoS for Voice over IP (voip). Cisco White Paper. [Online]. Available:
http://www.cisco.com/warp/public/732/Tech/qos/docs/autoqos wp.pdf. [Last accessed:
2008, 28 June].
[73] (2008, June) Solutions: Service control. Allot Communications. [On-
line]. Available: http://www.allot.com/index.php?option=com content&task=view&id=
51&Itemid=51 [Last accessed: 2008, June].
[74] (2008) Exinda - WAN Optimization, WAN Acceleration, Application Acceleration,
Unified Performance Management. Exinda Networks. [Online]. Available: http:
//www.exinda.com/public/products/products.html. [Last accessed: 2009, 22 February].
[75] NetIntact, NetIntact, 2008. [Online]. Available: http://www.netintact.com/ [Last
accessed: 2008, June].
[76] J. But, T. Nguyen, L. Stewart, N. Williams, and G. Armitage, “Peformance analysis of
the ANGEL system for automated control of game traffic prioritisation,” in NetGames
’07: Proceedings of the 6th ACM SIGCOMM workshop on Network and system support
for games. Melbourne, Australia: ACM, September 2007, pp. 123–128.
[77] (2008, June) How StreamEngine works. Ubicom. [Online]. Available: http:
//streamengine.ubicom.com/html/activity.cfm?page=how streamengine works [Last ac-
cessed: 2009, 22 February].
[78] L. Burgstahler, K. Dolzer, C. Hauser, J. Jahnert, S. Junghans, C. Macian, and W. Payer,
“Beyond technology: the missing pieces for QoS success,” in RIPQoS ’03: Proceedings
of the ACM SIGCOMM workshop on Revisiting IP QoS. Karlsruhe, Germany: ACM
Press, August 2003, pp. 121–130.
[79] J. K. M.-M. Varian and H. R., “Pricing the Internet,” EconWPA, Computational
BIBLIOGRAPHY 227
Economics 9401002, January 1994. [Online]. Available: http://ideas.repec.org/p/wpa/
wuwpco/9401002.html [Last accessed: 2009, 22 February].
[80] F. P. Kelly, “Charging and accounting for bursty connections,” Internet economics, pp.
253–278, 1997.
[81] F. Kelly, “Charging and rate control for elastic traffic,” European Transactions on
Telecommunications, vol. 8, pp. 33–37, 1997.
[82] S. Shenker, D. Clark, D. Estrin, and S. Herzog, “Pricing in computer networks: reshaping
the research agenda,” ACM SIGCOMM Computer Communication Review, vol. 26, no. 2,
pp. 19–43, April 1996.
[83] N. Keon and G. Anandalingam. (2003, July) A new pricing
model for competitive telecommunications services using congestion dis-
counts. [Online]. Available: http://mail3.rhsmith.umd.edu/Faculty/KM/papers.nsf\
/0/d5ea3f525a84fc5485256d0c006f210d?OpenDocument [Last accessed: 2009, 22
February].
[84] D. Clark, “Combining sender and receiver payments in the Internet,” in Telecommunica-
tions Research Policy Conference, 1996.
[85] M. Odlyzko, “Paris metro pricing for the Internet,” in EC ’99: Proceedings of the 1st
ACM conference on Electronic commerce, Denver, Colorado, United States, 1999, pp.
140–147.
[86] P. Dube, V. Borkar, and D. Manjunath, “Differential join prices for parallel queues: so-
cial optimality, dynamic pricing algorithms and application to internet pricing,” in IN-
FOCOM 2002. Twenty-First Annual Joint Conference of the IEEE Computer and Com-
munications Societies. Proceedings. IEEE, vol. 1, 2002, pp. 276–283.
[87] P. Marbach, “Priority service and max-min fairness,” in IEEE INFOCOM 2002, The 21st
Annual Joint Conference of the IEEE Computer and Communications Societies, vol. 1,
New York, USA, 2002, pp. 266–275.
228 BIBLIOGRAPHY
[88] P. Marbach, “Priority service and max-min fairness,” IEEE/ACM Transactions on Net-
working, vol. 11, no. 5, pp. 733–746, October 2003.
[89] R. Cocchi, D. Estrin, S. Shenker, and L. Zhang, “A study of priority pricing in multiple
service class networks,” in SIGCOMM ’91: Proceedings of the conference on Communi-
cations architecture protocols. USA: ACM Press, 1991, pp. 123–130.
[90] G. Fankhauser and B. Plattner, “Diffserv bandwidth brokers as mini-markets,” in Pro-
ceedings of International Workshop on Internet Service Quality Economics (ICQE), MIT,
US, December 2-3 1999.
[91] X. Wang and H. Schulzrinne, “RNAP: A resource negotiation and pricing protocol,”
in Proceedings of the nineth International Workshop Network and Operating Systems
Support for Digital Audio and Video NOSSDAV ’99, Basking Ridge, NJ, June 1999, pp.
77–93.
[92] M. Yuksel and A. Kalyanaraman, S. Goel, “Congestion pricing overlaid on edge-to-
edge congestion control,” IEEE International Conference on Communications (ICC ’03),
2003., vol. 2, pp. 880–884, May 2003.
[93] A. J. O. Sethu and Harish, “Congestion control, Differentiated Services, and Efficient
Capacity Management Through a Novel Pricing Strategy,” Computer Communications,
vol. 26, no. 13, pp. 1457–1469, 2003.
[94] B. Stiller, P. Reichl, and S. Leinen, “Pricing and cost recovery for Internet Services:
Practical review, classification and application of relevant models.” in NETNOMICS -
Economic Research and Electronic Networking, vol. 3. Kluwer Academic Publishers,
2001, pp. 149–171.
[95] “Bills Digest no.67 1997-98, Telecommunications Legislation Amendment Bill 1997,”
Parliament of Australia, 1997.
[96] B. Karpagavinayagam, R. State, and O. Festor, “Monitoring architecture for Lawful In-
terception in VoIP networks,” in Second International Conference on Internet Monitoring
and Protection (ICIMP 2007), San Jose, CA, July 2007.
BIBLIOGRAPHY 229
[97] A. Rojas, P. Branch, and G. Armitage, “Predictive Lawful Interception in mobile IPv6
networks,” in ICON 2007: 15th IEEE International Conference on Networks, 2007.,
Adelaide, Australia, November 2007, pp. 501–506.
[98] A. Moore and D. Zuev, “Internet traffic classification using Bayesian analysis tech-
niques,” in SIGMETRICS ’05: Proceedings of the 2005 ACM SIGMETRICS interna-
tional conference on Measurement and modeling of computer systems. Banff, Alberta,
Canada: ACM, June 2005, pp. 50–60.
[99] J. Erman, A. Mahanti, and M. Arlitt, “Byte me: a case for byte accuracy in traffic clas-
sification,” in MineNet ’07: Proceedings of the 3rd annual ACM workshop on Mining
network data. San Diego, California, USA: ACM Press, June 2007, pp. 35–38.
[100] (2007, August) Internet Assigned Numbers Authority (IANA). [Online]. Available:
http://www.iana.org/assignments/port-numbers [Last accessed: 2009, 22 February].
[101] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, “RTP: A transport protocol for
real-time applications,” RFC 1889, IETF, 1996.
[102] A. Moore and K. Papagiannaki, “Toward the accurate identification of network applica-
tions,” in Sixth Passive and Active Measurement Workshop (PAM), Boston, MA, USA,
March/April 2005.
[103] A. Madhukar and C. Williamson, “A longitudinal study of P2P traffic classification,” in
14th IEEE International Symposium on Modeling, Analysis, and Simulation of Computer
and Telecommunication Systems, September 2006, pp. 179 –188.
[104] V. Paxson, “Empirically derived analytic models of wide-area TCP connections,”
IEEE/ACM Transactions on Networking, vol. 2, no. 4, pp. 316–336, 1994.
[105] C. Dewes, A. Wichmann, and A. Feldmann, “An analysis of Internet chat systems,” in
IMC ’03: Proceedings of the 3rd ACM SIGCOMM conference on Internet measurement.
Miami, Florida, USA: ACM, October 2003, pp. 51–64.
[106] K. C. Claffy, “Internet traffic characterisation,” PhD Thesis, University of California, San
Diego, 1994.
230 BIBLIOGRAPHY
[107] T. Lang, G. Armitage, P. Branch, and H.-Y. Choo, “A synthetic traffic model for Half-
Life,” in Proceedings of Australian Telecommunications Networks & Applications Con-
ference 2003 ATNAC2003, Melbourne, Australia, December 2003.
[108] T. Lang, P. Branch, and G. Armitage, “A synthetic traffic model for Quake 3,” in Pro-
ceedings of ACM SIGCHI ACE2004, Singapore, June 2004.
[109] I. Witten and E. Frank, Data mining: Practical machine learning tools and techniques
with Java implementations, 2nd ed. Morgan Kaufmann Publishers, 2005.
[110] Z. Shi, Principles of machine learning. International Academic Publishers, 1992.
[111] H. A. Simon, “Why should machines learning,” in R. S. Michalski, J. G. Carbonell, and
T. M. Mitchell (ed) Machine Learning: An Artificial Intelligence Approach. Tioga, 1983.
[112] B. Silver, “Netman: A learning network traffic controller,” in IEA/AIE ’90: Proceedings
of the 3rd international conference on Industrial and engineering applications of artifi-
cial intelligence and expert systems. Charleston, South Carolina, United States: ACM,
1990, pp. 923–931.
[113] Y. Reich and S. J. Fenves, “The formation and use of abstract concepts in design,” Con-
cept formation knowledge and experience in unsupervised learning, pp. 323–353, 1991.
[114] R. Kohavi, J. R. Quinlan, W. Klosgen, and J. Zytkow, “Decision tree discovery,” Hand-
book of Data Mining and Knowledge Discovery, pp. 267–276, 2002.
[115] G. John and P. Langley, “Estimating continuous distributions in Bayesian classifiers,” in
Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. Mon-
treal, Quebec, Canada: Morgan Kaufmann, August 1995, pp. 338–345.
[116] E. B. Hunt, J. Marin, and P. T. Stone, Experiments in Induction. New York, NY, USA:
Academic Press, 1966.
[117] J. Han and M. Kamber, Data Mining - Concepts and Techniques. Morgan Kaufmann
Publishers, 2001.
BIBLIOGRAPHY 231
[118] D. Gamberger, T. Smuc, and I. Maric. (2006, April) Tutorial on decision tree. [Online].
Available: http://dms.irb.hr/tutorial/tut dtrees.php [Last accessed: 2009, 22 February].
[119] R.-H. Li and G. G. Belford, “Instability of decision tree classification algorithms,” in
KDD ’02: Proceedings of the eighth ACM SIGKDD international conference on Knowl-
edge discovery and data mining. New York, NY, USA: ACM Press, 2002, pp. 570–575.
[120] D. J. Haglin. (2006, April) Decision trees for supervised learning. [Online]. Available:
http://grb.mnsu.edu/grbts/doc/manual/J48 Decision Trees.html [Last accessed: 2009, 22
February].
[121] H. D. Fisher, J. M. Pazzani, and P. Langley, Concept Formation: Knowledge and Ex-
perience in Unsupervised Learning. San Francisco, CA, USA: Morgan Kaufmann
Publishers, 1991.
[122] L. Bernaille, R. Teixeira, I. Akodkenou, A. Soule, and K. Salamatian, “Traffic classifica-
tion on the fly,” ACM SIGCOMM Computer Communication Review, vol. 36, no. 2, pp.
23–26, 2006.
[123] J. Erman, A. Mahanti, M. Arlitt, and C. Williamson, “Identifying and discriminating
between web and peer-to-peer traffic in the network core,” in WWW ’07: Proceedings of
the 16th international conference on World Wide Web. Banff, Alberta, Canada: ACM
Press, May 2007, pp. 883–892.
[124] (2006, April) Weka en:primer (3.4.6). The University of Waikato. [Online]. Available:
http://weka.sourceforge.net/wekadoc/index.php/en:Primer [Last accessed: 2009, 22
February].
[125] (2009, February) WEKA API documentation (weka.clusterers class EM). [Online].
Available: http://weka.sourceforge.net/doc/ [Last accessed: 2009, 22 February].
[126] R. Duda, P. Hart, and D. Stork, Pattern Classification, 2nd ed. JWiley-Interscience,
2000.
232 BIBLIOGRAPHY
[127] O. Carmichael and M. Hebert, “Shape-based recognition of wiry objects,” IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, vol. 26, no. 12, pp. 1537–1552,
December 2004.
[128] (2007, November) [Wekalist] ten-fold cross validation (1). [Online]. Available: https://
list.scms.waikato.ac.nz/mailman/htdig/wekalist/2005-April/003836.html [Last accessed:
2007, 30 November].
[129] (2007, November) [wekalist] ten-fold cross validation (2). [Online]. Available: https://
list.scms.waikato.ac.nz/mailman/htdig/wekalist/2005-April/003847.html [Last accessed:
2007, 30 November].
[130] W. Rand, “Objective criteria for the evaluation of clustering methods,” Journal of the
American Statistical Association, vol. 66, no. 336, pp. 846–850, 1971.
[131] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “Cluster validity methods: part I,” ACM
SIGMOD Record, vol. 31, no. 2, pp. 40–45, 2002.
[132] R. Xu and D. Wunsch, “Survey of clustering algorithms,” IEEE Transactions on Neural
Networks, vol. 16, no. 3, pp. 645–678, May 2005.
[133] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “Clustering validity checking methods:
part II,” ACM SIGMOD Record, vol. 31, no. 3, pp. 19–27, 2002.
[134] M. Hall and G. Holmes, “Benchmarking attribute selection techniques for discrete class
data mining,” IEEE Transactions on Knowledge and Data Engineering, vol. 15, no. 6,
pp. 1437–1447, November/December 2003.
[135] D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning.
Boston, MA, USA: Addison-Wesley Longman Publishing Co., Inc., 1989.
[136] R. Kohavi and G. H. John, “Wrappers for feature subset selection,” Artificial Intelligent,
vol. 97, no. 1-2, pp. 273–324, 1997.
[137] P. H. Winston, Artificial Intelligence, 2nd ed. Boston, MA, USA: Addison-Wesley
Longman Publishing Co., Inc., 1984.
BIBLIOGRAPHY 233
[138] J. Zhang and I. Mani, “kNN approach to unbalanced data distributions: A case study
involving information extraction,” in Proceedings of the ICML’03 Workshop on Learning
from Imbalanced Data Sets, Washington, DC, 2003.
[139] S. Visa and A. Ralescu, “Issues in mining imbalanced data sets - a review paper,” in Pro-
ceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference,
MAICS-2005, 2005, pp. 67–73.
[140] C. X. Ling and C. Li, “Data mining for direct marketing: Problems and solutions,” in
Knowledge Discovery and Data Mining. AAAI Press, 1998, pp. 73–79.
[141] M. Kubat and S. Matwin, “Addressing the curse of imbalanced training sets: one-sided
selection,” in Proceedings of the Fourteenth International Conference on Machine Learn-
ing. Morgan Kaufmann, 1997, pp. 179–186.
[142] P. Domingos, “Metacost: a general method for making classifiers cost-sensitive,” in KDD
’99: Proceedings of the fifth ACM SIGKDD international conference on Knowledge dis-
covery and data mining. San Diego, California, United States: ACM, 1999, pp. 155–
164.
[143] N. Japkowicz, C. Myers, and M. A. Gluck, “A novelty detection approach to classifica-
tion,” in Proceedings of the Fourteenth Joint Conference on Artificial Intelligence, 1995,
pp. 518–523.
[144] A. Y. Liu, “The effect of oversampling and undersampling on classifying imbalanced text
datasets,” Master Thesis, The University of Texas at Austin, 2004.
[145] A. Nickerson, N. Japkowicz, and E. Milios, “Using unsupervised learning to guide re-
sampling in imbalanced data sets,” in Proceedings of the Eighth International Workshop
on Artificial Intelligence and Statistics, 2001, pp. 261–265.
[146] M. G. Weiss and F. Provost, “Learning when training data are costly: The effect of class
distribution on tree induction,” Journal of Artificial Intelligence Research, vol. 19, pp.
315–354, 2003.
234 BIBLIOGRAPHY
[147] S. Visa and A. Ralescu, “The effect of imbalanced data class distribution on fuzzy classi-
fiers - experimental study,” in The 14th IEEE International Conference on Fuzzy Systems
(FUZZ ’05), 2005, May 2005, pp. 749–754.
[148] N. Japkowicz, “Learning from imbalanced data sets: a comparison of various
strategies,” Learning from imbalanced data sets: The AAAI Workshop 10-15.
Menlo Park, CA: AAAI Press, Tech. Rep. WS-00-05, 2000. [Online]. Available:
http://www.aaai.org/Library/Workshops/2000/ws00-05-003.php [Last accessed: 2009,
22 February].
[149] N. Japkowicz and S. Stephen, “The class imbalance problem: A systematic study,” Intel-
ligent Data Analysis Journal, vol. 6, no. 5, pp. 429–450, 2002.
[150] D. Z. . M. C. Andrew Moore, “Discriminators for use in flow-based classification,”
Department of Computer Science, Queen Mary, University of London, Tech. Rep.
RR-05-13, August 2005. [Online]. Available: http://www.dcs.qmul.ac.uk/tech reports/
RR-05-13.pdf [Last accessed: 2009, 22 February].
[151] N. Williams, S. Zander, and G. Armitage, “Evaluating machine learning methods for
online game traffic identification,” Centre for Advanced Internet Architectures (CAIA),
Tech. Rep. 060410C, April 2006. [Online]. Available: http://caia.swin.edu.au/reports/
060410C/CAIA-TR-060410C.pdf [Last accessed: 2009, 22 February].
[152] J. Park, H.-R. Tyan, and K. C.-C.J., “Internet traffic classification for scalable QoS pro-
vision,” in IEEE International Conference on Multimedia and Expo, 2006, Toronto, On-
tario, Canada, July 2006, pp. 1221–1224.
[153] J. Erman, M. Arlitt, and A. Mahanti, “Traffic classification using clustering algorithms,”
in MineNet ’06: Proceedings of the 2006 SIGCOMM workshop on Mining network data.
Pisa, Italy: ACM, 2006, pp. 281–286.
[154] Y. Yang and R. Kravets, “Throughput guarantees for multi-priority traffic in ad hoc net-
works,” Ad Hoc Networks, vol. 5, pp. 228–253, 2007.
BIBLIOGRAPHY 235
[155] H.-H. W. Lin and X., “Multiple priorities QoS scheduling for simultaneous videos trans-
missions,” in Proceedings. International Symposium on Multimedia Software Engineer-
ing, 2000. IEEE Computer Society, 2000, pp. 135–141.
[156] V. Ambetkar, P. Bender, J. Ma, Y. Pei, and W. J. Modestino, “Distributed flow admission
control for real-time multimedia services over wireless ad hoc networks,” in MobiMedia
’06: Proceedings of the 2nd international conference on Mobile multimedia communi-
cations. Alghero, Italy: ACM, 2006, pp. 1–6.
[157] D. X. Shengquan Wang and W. Zhao, “Differentiated Services with statistical QoS guar-
antees in static-priority scheduling networks,” Texas A & M University, Tech. Rep.
TR01-015, 2001.
[158] K. Cieliebak and B. Liver, “How many QoS classes are optimal?” in EC ’99: Proceed-
ings of the 1st ACM conference on electronic commerce. Denver, Colorado, United
States: ACM, 1999, pp. 48–57.
[159] T. Nguyen and G. Armitage, “Training on multiple sub-flows to optimise the use of
machine learning classifiers in real-world IP networks,” in Proceedings 2006 31st IEEE
Conference on Local Computer Networks, Tampa, Florida, USA, November 2006, pp.
369–376.
[160] T. Nguyen and G. Armitage, “Synthetic sub-flow pairs for timely and stable IP traffic
identification,” in Proceedings of Australian Telecommunication Networks and Applica-
tion Conference, Melbourne, Australia, December 2006.
[161] A. Dempster, N. Laird, and D. Rubin, “Maximum Likelihood from incomplete data via
the EM algorithm,” Journal of Royal Statistical Society, vol. 39, pp. 1–22, 1977.
[162] P. Cheeseman and J. Stutz, “Bayesian classification (AutoClass): Theory and results,” in
Advances in Knowledge Discovery and Data Mining. Menlo Park, CA, USA: American
Association for Artificial Intelligence, 1996, pp. 153–180.
[163] C. Schmoll and S. Zander, Netmate, February 2009. [Online]. Available: http:
//sourceforge.net/projects/netmate-meter/ [Last accessed: 2009, 22 February].
236 BIBLIOGRAPHY
[164] (2006, September) Traffic measurement data repository. The National Laboratory for
Applied Network Research (NLANR). [Online]. Available: http://pma.nlanr.net/Special/
[Last accessed: 2009, 22 February].
[165] T. Auld, A. W. Moore, and S. F. Gull, “Bayesian neural networks for Internet traffic clas-
sification,” IEEE Transactions on Neural Networks, vol. 18, no. 1, pp. 223–239, January
2007.
[166] J. Park, H.-R. Tyan, and C.-C. J. Kuo, “GA-based Internet traffic classification technique
for QoS provisioning,” in IIH-MSP ’06: Proceedings of the 2006 International Confer-
ence on Intelligent Information Hiding and Multimedia. Pasadena, California: IEEE
Computer Society, December 2006, pp. 251–254.
[167] M. Crotti, M. Dusi, F. Gringoli, and L. Salgarelli, “Traffic classification through simple
statistical fingerprinting,” ACM SIGCOMM Computer Communication Review, vol. 37,
no. 1, pp. 5–16, 2007.
[168] J. Erman, A. Mahanti, M. Arlitt, I. Cohen, and C. Williamson, “Semi-supervised net-
work traffic classification,” ACM SIGMETRICS Performance Evaluation Review, vol. 35,
no. 1, pp. 369–370, 2007.
[169] J. Erman, A. Mahanti, M. Arlitt, I. Cohen, and C. Williamson, “Offline/realtime
network traffic classificatioin using semi-supervised learning,” Department of Computer
Science, University of Calgary, Tech. Rep., February 2007. [Online]. Available: http:
//pages.cpsc.ucalgary.ca/∼mahanti/papers/semi.supervised.pdf [Last accessed: 2009, 22
February].
[170] J. Erman, A. Mahanti, and M. Arlitt, “QRP05-4: Internet traffic identification using
machine learning,” in GLOBECOM ’06. IEEE Global Telecommunications Conference,
2006., San Francisco, USA, December 2006, pp. 1–6.
[171] N. Williams, S. Zander, and G. Armitage, “A preliminary performance comparison of
five machine learning algorithms for practical IP traffic flow classification,” SIGCOMM
Computer Communication Review, vol. 36, no. 5, pp. 5–16, 2006.
BIBLIOGRAPHY 237
[172] P. Haffner, S. Sen, O. Spatscheck, and D. Wang, “ACAS: Automated construction of
application signatures,” in MineNet ’05: Proceeding of the 2005 ACM SIGCOMM work-
shop on Mining network data. New York, NY, USA: ACM Press, August 2005, pp.
197–202.
[173] N. Z. Friis and Janus, Skype, Skype Technologies S.A., February 2009. [Online].
Available: http://www.skype.com/intl/en/. [Last accessed: 2009, 22 February].
[174] E. Gumbeil, Statistics of Extremes. New York: Columbia University Press, 1958.
[175] Weka 3.4.4, The University of Waikato, February 2009. [Online]. Available:
http://www.cs.waikato.ac.nz/ml/weka [Last accessed: 2009, 22 February].
[176] G. Armitage, M. Claypool, and P. Branch, Networking and online games - undertanding
and engineering multiplayer Internet games. UK: John Wiley & Sons, 2006.
[177] T. W. Anderson and D. A. Darling, “Asymptotic theory of certain g̈oodness of fitc̈riteria
based on stochastic processes,” Annals of Mathematical Statistics, vol. 23, no. 2, pp.
193–212, 1952.
[178] W. J. Conover, Practical nonparametric statistics. New York: John Wiley & Sons,
1971.
[179] (2009, February) CAIA Grangenet game server (GENIUS project). Centre for Advanced
Internet Architectures (CAIA). [Online]. Available: http://caia.swin.edu.au/genius/
games.html [Last accessed: 2009, 22 February].
[180] S. Zander, D. Kennedy, and G. Armitage, “Dissecting server-discovery traffic patterns
generated by multiplayer first person shooter games,” in NetGames ’05: Proceedings of
4th ACM SIGCOMM workshop on Network and system support for games. Hawthorne,
NY: ACM, October 2005, pp. 1–12.
[181] (2006, March) Traffic measurement data repository. The University of Twente. [Online].
Available: http://m2c-a.cs.utwente.nl/repository [Last accessed: 2006, 26 March].
238 BIBLIOGRAPHY
[182] (2007, January) Supercomputing overview. The Centre for Astrophysics and
Supercomputing, Swinburne University of Technology. [Online]. Available: http:
//astronomy.swinburne.edu.au/supercomputing/ [Last accessed: 2009, 22 February].
[183] T. G. Renna, I. Bar-Kana, and P. Kalata, “A two-level gain stochastic disturbance ob-
server with hysteresis,” in IEEE International Conference on Systems Engineering, Au-
gust 1990, pp. 77–80.
[184] Qstat, The open group base specifications issue 6, IEEE Std 1003.1, 2004 edition,
January 2009. [Online]. Available: http://www.opengroup.org/onlinepubs/000095399/
utilities/qstat.html. [Last accessed: 2009, 22 February].
[185] ITU-T, “G.71 : Pulse code modulation (PCM) of voice frequencies,” G.711 ITU-T Stan-
dard, International Telecommunication Union, 1988.
[186] ETSI, “European Digital Cellular Telecommunications System (Phase 2): Full rate
speech transcoding. ETSI spec. GSM 06.10, GSM 06.32 ed.” European Standard, The
International Telegraph and Telephone Consultative Committee., 1994.
[187] J. Rosenberg, H. Schulzrinne, G. Camarillo, A. Johnston, J. Peterson, R. Sparks, M. Han-
dley, and E. Schooler, “SIP: Session Initiation Protocol,” RFC 3216, IETF, 2002.
[188] M. Handley and V. Jacobson, “SDP: Session Description Protocol,” RFC 2327, IETF,
1998.
[189] H. Schulzrinne and S. Casner, “RTP profile for audio and video conferences with minimal
control,” RFC 3551, IETF, 2003.
[190] R. Zopf, “Real-time transport protocol (RTP) payload for comfort noise (CN),” RFC
3389, IETF, 2002.
[191] J.-C. Bolot, “End-to-end packet delay and loss behavior in the Internet,” ACM SIGCOMM
Computer Communication Review, vol. 23, no. 4, pp. 289–298, 1993.
[192] W. Jiang and H. Schulzrinne, “Comparison and optimization of packet loss repair meth-
ods on VoIP perceived quality under bursty loss,” in NOSSDAV ’02: Proceedings of the
BIBLIOGRAPHY 239
12th international workshop on Network and operating systems support for digital audio
and video. Miami, Florida, USA: ACM, 2002, pp. 73–81.
[193] D. Loguinov and H. Radha, “Measurement study of low-bitrate Internet video stream-
ing,” in IMW ’01: Proceedings of the 1st ACM SIGCOMM Workshop on Internet Mea-
surement. San Francisco, California, USA: ACM, 2001, pp. 281–293.
[194] M. Borella, D. Swider, S. Uludag, and G. Brewster, “Internet packet loss: Measurement
and implications for end-to-end QoS,” in Proceedings of the 1998 ICPP Workshops on
Architectural and OS Support for Multimedia Applications/Flexible Communication Sys-
tems/Wireless Networks and Mobile Computing. IEEE Computer Society, August 1998,
pp. 3–12.
[195] H. Balakrishnan, V. Padmanabhan, S. Seshan, M. Stemm, and R. Katz, “TCP behavior
of a busy Internet server: Analysis and improvements,” in INFOCOM ’98, Seventeenth
Annual Joint Conference of the IEEE Computer and Communications Societies. San
Francisco, CA, USA: University of California at Berkeley, 1998, pp. 252–262.
[196] V. E. Paxson, “Measurements and analysis of end-to-end Internet dynamics,” PhD Thesis,
University of California at Berkeley, Berkeley, CA, USA, 1998.
[197] M. Yajnik, S. Moon, J. Kurose, and D. Towsley, “Measurement and modelling of the
temporal dependence in packet loss,” in INFOCOM ’99. Eighteenth Annual Joint Con-
ference of the IEEE Computer and Communications Societies, vol. 1, March 1999, pp.
345–352.
[198] M. Dischinger, A. Haeberlen, K. P. Gummadi, and S. Saroiu, “Characterizing residential
broadband networks,” in IMC ’07: Proceedings of the 7th ACM SIGCOMM conference
on Internet measurement. San Diego, California, USA: ACM, 2007, pp. 43–56.
[199] M. Mathis, J. Semke, and J. Mahdavi, “The macroscopic behavior of the TCP conges-
tion avoidance algorithm,” ACM SIGCOMM Computer Communication Review, vol. 27,
no. 3, pp. 67–82, 1997.
240 BIBLIOGRAPHY
[200] L. Cottrell. (2000, February) Throughput versus loss. Stanford Linear Accelerator Center.
[Online]. Available: http://www.slac.stanford.edu/comp/net/wan-mon/thru-vs-loss.html
[Last accessed: 2009, 22 February].
[201] J.-A. Bussiere and S. Zander, “Enemy Territory traffic analysis,” Centre for Advanced
Internet Architectures (CAIA), Tech. Rep. 060203A, February 2006. [Online]. Available:
http://caia.swin.edu.au/reports/060203A/CAIA-TR-060203A.pdf [Last accessed: 2009,
22 February].
[202] J. Ma, K. Levchenko, C. Kreibich, S. Savage, and G. M. Voelker, “Unexpected means
of protocol inference,” in IMC ’06: Proceedings of the 6th ACM SIGCOMM on Internet
measurement. Rio de Janeriro, Brazil: ACM Press, October 2006, pp. 313–326.
[203] C. Jin, H. Wang, and K. G. Shin, “Hop-count filtering: an effective defense against
spoofed DDoS traffic,” in CCS ’03: Proceedings of the 10th ACM conference on Com-
puter and communications security. Washington D.C., USA: ACM, 2003, pp. 30–41.
[204] G. Armitage, C. Javier, and S. Zander, “Post-game estimation of game client RTT
and hop count distributions,” Centre for Advanced Internet Architectures (CAIA),
Tech. Rep. 060801A, August 2006. [Online]. Available: http://caia.swin.edu.au/reports/
060410C/CAIA-TR-060801A.pdf [Last accessed: 2009, 22 February].
[205] Wireshark, Wireshark foundation, February 2009. [Online]. Available: http:
//www.wireshark.org/ [Last accessed: 2009, 22 February].
[206] Wireshark. (2008, December) Wireshark frequently asked questions. [Online]. Available:
http://www.wireshark.org/faq.html [Last accessed: 2009, 22 February].
[207] H. Schulzrinne and S. Petrack, “RTP payload for DTMF digits, telephony tones and
telephony signals,” RFC 2833, IETF, 2000.
[208] T. Nguyen and G. Armitage, “Clustering to assist supervised machine learning for real-
time IP traffic classification,” in IEEE International Conference on Communications
(ICC ’08), 2008, Beijing, China, 2008, pp. 5857–5862.
List of Figures
2.1 A typical DOCSIS cable network from ISP to home users . . . . . . . . . . . 31
2.2 Evaluation Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.1 An example dataset as a matrix of instances versus features . . . . . . . . . . . 45
3.2 An illustration of full-flow flow. The forward direction is normally defined as
the client-to-server direction . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
3.3 An illustration of the definition of flow direction and features calculation . . . . 58
3.4 A simple scenario of online game traffic classification . . . . . . . . . . . . . . 59
3.5 Training and classification for a two-classes supervised ML traffic classifier . . 60
3.6 Example of an automated QoS and priority control . . . . . . . . . . . . . . . 64
3.7 Example operation of an IP flows classifier . . . . . . . . . . . . . . . . . . . . 65
5.1 An illustration of sub-flow definition . . . . . . . . . . . . . . . . . . . . . . . 93
5.2 Packet length from client to server for ET traffic - N = 25 packets . . . . . . . . 96
5.3 Packet length from server to client for ET traffic - N = 25 packets . . . . . . . . 96
5.4 Mean packet length from client to server for ET traffic - N = 25 packets . . . . 97
5.5 Standard deviation of packet length from client to server for ET traffic - N = 25
packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.6 Mean packet length in the client-to-server direction, calculated for the window
of the first N packets taken from 1,000 flow samples for ET traffic (1,000 values
of the means for each N value) . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.7 The standard deviation of packet length in the client-to-server direction, calcu-
lated for window of the first N packets taken from 1,000 flow samples for ET
traffic (1,000 values of the standard deviations for each N value) . . . . . . . . 99
241
242 LIST OF FIGURES
5.8 High-level description of datasets used for training and testing . . . . . . . . . 101
5.9 Distribution of different applications’ traffic (in flows and percentage) in the
training datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
5.10 Distribution of different applications’ traffic (in flows and percentage) in testing
datasets for N = 25 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
5.11 ET Recall: Classifier trained with full-flows, tested with four different sliding
windows - Naive Bayes models . . . . . . . . . . . . . . . . . . . . . . . . . . 109
5.12 ET Precision: Classifier trained with full-flows, tested with four different slid-
ing windows - Naive Bayes models . . . . . . . . . . . . . . . . . . . . . . . . 110
5.13 ET Recall: Classifier trained with full-flows, tested with four different sliding
windows - C4.5 Decision Tree models . . . . . . . . . . . . . . . . . . . . . . 111
5.14 ET Precision: Classifier trained with full-flows, tested with four different slid-
ing windows - C4.5 Decision Tree models . . . . . . . . . . . . . . . . . . . . 111
5.15 ET Recall and Precision: Classifier trained on filtered full-flows, N = 25 for
classification - Naive Bayes models . . . . . . . . . . . . . . . . . . . . . . . 112
5.16 An illustration of creating classification rules for the full-flow and filtered full-
flow models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.17 ET Recall and Precision: Classifier trained on filtered full-flows, N = 25 for
classification - C4.5 Decision Tree models . . . . . . . . . . . . . . . . . . . . 115
5.18 ET Recall: Classifier trained on 25-packet sub-flows, N = 25 for classification -
Naive Bayes models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.19 ET Precision: Classifier trained on 25-packet sub-flows, N = 25 for classifica-
tion - Naive Bayes models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.20 ET Recall: Classifier trained on 25-packet sub-flows, N = 25 for classification -
C4.5 Decision Tree models . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
5.21 ET Precision: Classifier trained on 25-packet sub-flows, N = 25 for classifica-
tion - C4.5 Decision Tree models . . . . . . . . . . . . . . . . . . . . . . . . . 118
5.22 ET Precision: Classifier trained on 25-packet sub-flows, N = 25 for classifica-
tion - C4.5 Decision Tree models - a zoomed-in version of Figure 5.21 . . . . . 119
LIST OF FIGURES 243
5.23 ET Recall: Comparing full-flow and sub-flow training of the Naive Bayes clas-
sifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.24 ET Precision: Comparing full-flow and sub-flow training of the Naive Bayes
classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
5.25 An illustration of creating multiple sub-flows classifier from a number of in-
dividual sub-flows (data points are artificially created for illustrative purposes
only). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.26 ET Recall: Comparing full-flow and sub-flow training of the Classifier- C4.5
Decision Tree models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.27 ET Precision: Comparing full-flow and sub-flow training of the C4.5 Decision
Tree classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
6.1 An illustration of the sub-flow identification step . . . . . . . . . . . . . . . . 131
6.2 An illustration of selecting representative sub-flows to train a classifier . . . . . 132
6.3 Step 1 - Experimental approach . . . . . . . . . . . . . . . . . . . . . . . . . . 134
6.4 Number of instances for each sub-flow identified in Step 1 . . . . . . . . . . . 135
6.5 Sub-flow to cluster mapping and evaluation. . . . . . . . . . . . . . . . . . . . 136
6.6 Normalised number of instances in training each classifier . . . . . . . . . . . 139
6.7 Recall for Naive Bayes classifiers trained on various selections of full-flows and
sub-flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.8 Recall for Naive Bayes classifiers using multiple sub-flows, expanded from Fig-
ure 6.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
6.9 Precision for Naive Bayes classifiers trained on various selections of full-flows
and sub-flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.10 Precision for Naive Bayes classifiers using multiple sub-flows, expanded from
Figure 6.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
6.11 Recall for C4.5 Decision Tree classifiers trained on various selections of full-
flows and sub-flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
6.12 Precision for C4.5 Decision Tree classifiers trained on various selections of full-
flows and sub-flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
6.13 Normalised build time for Naive Bayes classifiers . . . . . . . . . . . . . . . . 145
244 LIST OF FIGURES
6.14 Normalised classification speed for Naive Bayes classifiers . . . . . . . . . . . 146
6.15 Normalised memory usage for Naive Bayes classifiers while performing 10-
times cross validation (during both training and testing) . . . . . . . . . . . . . 146
6.16 Normalised build time for C4.5 Decision Tree classifiers . . . . . . . . . . . . 147
6.17 Normalised classification speed for C4.5 Decision Tree classifiers . . . . . . . 148
6.18 Normalised memory usage for C4.5 Decision Tree classifiers while performing
10-times cross-validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.19 Summary of Precision / Recall results for Naive Bayes (NB) and C4.5 Decision
Tree (DT) classifiers trained on multiple sub-flows . . . . . . . . . . . . . . . 149
6.20 Summary of computational performance results for Naive Bayes (NB) and C4.5
Decision Tree (DT) classifiers trained on multiple sub-flows . . . . . . . . . . 150
6.21 Sampled clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
6.22 Precision and Recall for Naive Bayes classifiers using sub-flows selected by EM
with small numbers of samples for the clustering process. . . . . . . . . . . . . 155
6.23 Results for C4.5 Decision Tree classifiers using sub-flows selected by EM with
small numbers of samples for the clustering process. . . . . . . . . . . . . . . 156
6.24 Normalised Model Build Time for classifiers trained on sub-flows selected by
EM with small numbers of samples used in the clustering process . . . . . . . 157
6.25 Normalised classification speed for classifiers trained on sub-flows selected by
EM with small numbers of samples used in the clustering process . . . . . . . 157
6.26 An illustration of updating a classifier when new, previously unknown traffic is
detected . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
7.1 Steps in training an ML classifier for identification of ET traffic versus Other
traffic - without using the SSP approach . . . . . . . . . . . . . . . . . . . . . 164
7.2 An illustration of how to create a mirror-image replica for a sub-flow instance . 164
7.3 Option 1: Both sub-flows’ instances and the mirror-image replicas of every
short sub-flow are labelled as one class. The classifier is trained with two
classes: ET and Other. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
LIST OF FIGURES 245
7.4 Option 2: Sub-flows’ instances and their mirror-image replicas are labelled in-
dependently as two separate classes. The lassifier is trained with four classes:
ET, ET’, Other and Other’. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
7.5 Example datasets used to train a classifier using Option 1 and Option 2 . . . . . 167
7.6 An illustration of creating SSP classifier from sub-flow instances and their
mirror-image replicas (data points are artifically created for illustration purposes
only.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
7.7 Percentage of flows that have the first packet captured in the client-to-server
direction if the first M packets are missed . . . . . . . . . . . . . . . . . . . . 169
7.8 Recall for Naive Bayes classifiers trained on full-flow (full-flow model), filtered
full-flow (filtered full-flow model) and multiple sub-flows (multiple sub-flows
model) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171
7.9 Precision for Naive Bayes classifiers trained on full-flow, filtered full-flow and
multiple sub-flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
7.10 Recall for C4.5 Decision Tree classifiers trained on full-flow, filtered full-flow
and multiple sub-flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7.11 Precision for C4.5 Decision Tree classifiers trained on full-flow, filtered full-
flow and multiple sub-flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
7.12 Recall for Naive Bayes classifiers trained using SSP Option 1 and multiple sub-
flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
7.13 Precision for Naive Bayes classifiers trained using SSP Option 1 and multiple
sub-flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.14 Recall for C4.5 Decision Tree classifiers trained using SSP Option 1 and multi-
ple sub-flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
7.15 Precision for C4.5 Decision Tree classifiers trained using SSP Option 1 and
multiple sub-flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
7.16 Recall for Naive Bayes classifiers trained using SSP Option 1, SSP Option 2
and Multiple Sub-Flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
7.17 Precision for Naive Bayes classifiers trained using SSP Option 1, SSP Option 2
and multiple sub-flows . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
246 LIST OF FIGURES
7.18 Recall for C4.5 Decision Tree classifiers trained using SSP Option 1, SSP Op-
tion 2 and Multiple Sub-Flows . . . . . . . . . . . . . . . . . . . . . . . . . . 178
7.19 Precision for C4.5 Decision Tree classifiers trained using SSP Option 1, SSP
Option 2 and Multiple Sub-Flows . . . . . . . . . . . . . . . . . . . . . . . . 179
7.20 Computational performance for Naive Bayes and C4.5 Decision Tree classifiers
trained on multiple sub-flows, SSP Option 1 and SSP Option 2 . . . . . . . . . 180
8.1 Cummulative distribution of call duration . . . . . . . . . . . . . . . . . . . . 186
8.2 G.711 traffic - forward direction, mean packet length calculated over a window
of 25 packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
8.3 G.711 traffic - forward direction, mean packet inter-arrival time calculated over
a window of 25 packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
8.4 GSM traffic - forward direction, mean packet length calculated over a window
of 25 packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
8.5 GSM traffic - forward direction, mean packet inter-arrival time calculated over
a window of 25 packets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
8.6 Voice traffic generated during a voice conversation: Comfort noise packets and
silence suppression periods during a conversation can create asymmetry and
multiple packet sizes within the traffic captured by the sliding window . . . . . 190
8.7 VoIP Recall: Naive Bayes classifiers trained on full-flow and SSP-ACT . . . . 191
8.8 VoIP Precision: Naive Bayes classifiers trained on full-flow and SSP-ACT . . . 192
8.9 VoIP Recall: C4.5 Decision Tree classifiers trained on full-flow and SSP-ACT . 192
8.10 VoIP Precision: C4.5 Decision Tree classifiers trained on full-flow and SSP-ACT 193
8.11 VoIP classification using classifiers trained on full-flow and SSP-ACT: Training
on full-flow may cover a larger area of VoIP instances when classifying using
a small sliding window, hence resulting in higher Recall but lower Precision
compared to training using SSP-ACT. (The data points are artifically created
for illustration purposes only. They are not actual data points from my dataset.) 195
8.12 A simple illustration of the impact of packet loss on packet inter-arrival time
statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
LIST OF FIGURES 247
8.13 ET Recall: Training with SSP-ACT and classifying with ET traffic experiencing
5% random packet loss - Naive Bayes classifier . . . . . . . . . . . . . . . . . 199
8.14 ET Precision: Training with SSP-ACT and classifying with ET traffic experi-
encing 5% random packet loss - Naive Bayes classifier . . . . . . . . . . . . . 199
8.15 ET Recall: Training with SSP-ACT and classifying with ET traffic experiencing
5% random packet loss - C4.5 Decision Tree classifier) . . . . . . . . . . . . . 200
8.16 ET Precision: Training with SSP-ACT and classifying with ET traffic experi-
encing 5% random packet loss - C4.5 Decision Tree classifier . . . . . . . . . . 201
8.17 VoIP Recall: Training with SSP-ACT and classifying with VoIP traffic experi-
encing 5% random packet loss - Naive Bayes classifier . . . . . . . . . . . . . 201
8.18 VoIP Precision: Training with SSP-ACT and classifying with VoIP traffic expe-
riencing 5% random packet loss - Naive Bayes classifier . . . . . . . . . . . . 202
8.19 VoIP Recall: Training with SSP-ACT and classifying with VoIP traffic expe-
riencing 5% random packet loss - Naive Bayes classifier - C4.5 Decision Tree
classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
8.20 VoIP Precision: Training with SSP-ACT and classifying with VoIP traffic expe-
riencing 5% random packet loss - Naive Bayes classifier . . . . . . . . . . . . 203
8.21 Training for VoIP and ET traffic identification: Option A - Common classifier . 206
8.22 Training for VoIP and ET traffic identification: Option B - Separate classifiers
in parallel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
8.23 VoIP Recall and Precision: Naive Bayes classifier . . . . . . . . . . . . . . . . 208
8.24 VoIP Recall and Precision: C4.5 Decision Tree classifier . . . . . . . . . . . . 208
8.25 Computational performance for Naive Bayes and C4.5 Decision Tree classifiers
trained with Option A and Option B . . . . . . . . . . . . . . . . . . . . . . . 209
A.1 Client port range for the other selected applications . . . . . . . . . . . . . . . 252
A.2 ET packet length in C-S and S-C directions . . . . . . . . . . . . . . . . . . . 253
A.3 ET packet inter-arrival time in C-S and S-C directions . . . . . . . . . . . . . . 254
A.4 HTTP packet length in C-S and S-C directions . . . . . . . . . . . . . . . . . . 255
A.5 SMTP packet length in C-S and S-C directions . . . . . . . . . . . . . . . . . 255
A.6 P2P packet length in C-S and S-C directions . . . . . . . . . . . . . . . . . . . 256
248 LIST OF FIGURES
A.7 Packet length statistics calculated over five consecutive packets at different
phases during a flow’s lifetime for SMTP traffic - C-S direction . . . . . . . . . 256
A.8 Packet length statistics calculated over five consecutive packets at different
phases during a flow’s lifetime for Kazaa traffic - C-S direction . . . . . . . . . 257
A.9 Packet length statistics calculated over five consecutive packets at different
phases during a flow’s lifetime for HTTP traffic - S-C direction . . . . . . . . . 257
C.1 Top 10 countries that contributed the greatest amount of ET traffic in the training
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
C.2 Top 10 Countries that contributed the most amount of ET traffic in the testing
dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
C.3 Cumulative distribution of client hop counts per country for the training dataset 267
C.4 Distribution of different applications’ traffic (in flows and percentage) in testing
datasets for N = 10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
C.5 Distribution of different applications’ traffic (in flows and percentage) in testing
datasets for N = 100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
C.6 Distribution of different applications’ traffic (in flows and percentage) in testing
datasets for N = 1000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
D.1 G.711 traffic - mean packet length - reverse direction . . . . . . . . . . . . . . 272
D.2 G.711 traffic - mean packet inter-arrival time - reverse direction . . . . . . . . . 273
D.3 GSM traffic - mean packet length - reverse direction . . . . . . . . . . . . . . . 273
D.4 GSM traffic - mean packet inter-arrival time - reverse direction . . . . . . . . . 273
E.1 Recall for different classifiers trained using different number of clusters . . . . 276
E.2 Recall for different classifiers trained using different number of clusters . . . . 277
E.3 Normalised build time for different classifiers trained using different number of
clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
E.4 Normalised classification speed for different classifiers trained using different
number of clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
E.5 Normalised memory usage for different classifiers trained using different num-
ber of clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
LIST OF FIGURES 249
E.6 Normalised clustering time for different classifiers trained using different num-
ber of clusters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
List of Tables
5.1 Two-sample KS test p-values (probability of occurrence of the null hypothesis)
for the mean packet length feature sets calculated for different N values, based
on a set of 1000 flow samples . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2 ET traffic full-flow dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3 Sampled interfering application flows - full-flow dataset . . . . . . . . . . . . . 102
5.4 Detailed training and testing implementation for each experiment . . . . . . . . 105
5.5 Detailed training and testing implementation for each experiment (continued) . 106
6.1 The differences in training instances for each classifier . . . . . . . . . . . . . 138
6.2 Number of sub-flows selected automatically by the clustering process . . . . . 154
8.1 Comparison of the pros and cons of Option A: Common classifier versus Option
B: Separate classifiers in parallel . . . . . . . . . . . . . . . . . . . . . . . . . 211
B.1 A Summary of Research Reviewed in Chapter 4 . . . . . . . . . . . . . . . . . 260
B.2 A Summary of Research Reviewed in Chapter 4 (continued) . . . . . . . . . . 261
B.3 A Summary of Research Reviewed in Chapter 4 (continued) . . . . . . . . . . 262
B.4 A Summary of Research Reviewed in Chapter 4(continued) . . . . . . . . . . . 263
B.5 Reviewed work in light of considerations for operational traffic classification . 264
250
Appendix A
Traffic Characteristics of Selected InternetApplications
In this appendix I look at some characteristics of an FPS game, Wolfenstein Enemy Territory
(ET), and three other common Internet applications: SMTP, Kazaa, and HTTP. These applica-
tions are among the sample applications used to train my classifiers in Chapters 5 to 8). I place
emphasis on their asymmetry in the client-to-server (C-S) and server-to-client (S-C) directions,
and in the variation of their traffic statistics over a flow’s lifetime.
A.1 Asymmetric properties in bi-directional communication
The asymmetry in bi-directional traffic is seen both in UDP/TCP ports used at the client and
server and in the statistical properties in client-to-server (C-S) and server-to-client (S-C) direc-
tions for the applications considered.
A.1.1 Client server ports asymmetry
I sample up to 100 flows of each application for this analysis. ET traffic is sampled from a full-
month data trace collected at a public server [179] in Australia during September 2005. Only
flows with longer than 1,000 packets in the C-S direction are selected, to make sure that they
are actual game flows. Traffic for SMTP, Kazaa and HTTP is sampled from one 24-hour trace
collected by the University of Twente, the Netherlands, on the 6th of February 2004 [181].
In a typical TCP connection the traffic flow starts with a SYN/ACK handshake from a client
to a server. The server port to which the client addresses its initial SYN packet is usually well-
251
252APPENDIX A. TRAFFIC CHARACTERISTICS OF SELECTED INTERNET APPLICATIONS
known 1. The client port, on the other hand, is typically chosen dynamically. It is obvious that if
the classifier misses the first SYN packet (from C-S) there will be a chance that the first packet
that it captures will be from the reverse direction (S-C). For applications with asymmetric client
and server ports, a port-based classification approach would degrade its performance (i.e. would
produce a higher rate of false negatives).
Figure A.1 shows the client port numbers for ET, HTTP, Kazaa and SMTP traffic. These
applications show a wide distribution of client port numbers, which are spreading over the whole
range of possible port numbers (1 through 65,535). For my sample ET traffic, the server port is
configured as 27961 and client ports are distributed across a wide range. Approximately 50%
of the flow samples have client ports equal to the default ET port of 27960. Less than 1% of the
flow samples have client ports of 27961 (the port actually used by the ET server in this case).
For HTTP, Kazaa and SMTP traffic, the client ports are widely distributed, and different from
the server ports of 80, 1214 and 25 for each application respectively.
0 10000 20000 30000 40000 50000 60000
0.0
0.2
0.4
0.6
0.8
1.0
Client Port Number
Cum
ulat
ive
Dis
tribu
tion
Func
tion
(0−1
)
ETHTTPSMTPKazaa
Figure A.1: Client port range for the other selected applications
A.1.2 Statistical Properties Asymmetry
All applications considered exhibit asymmetry in statistical properties in the C-S and S-C di-
rections. Figures A.2 and Figure A.3 illustrate this aspect of ET traffic in terms of packet
1If the application uses a registered port [100] or a port number selected from a range of default values.
A.1. ASYMMETRIC PROPERTIES IN BI-DIRECTIONAL COMMUNICATION 253
length and packet inter-arrival time2. Packet length in the C-S (normally containing the client’s
queries and updates) is typically small, mostly ranged between 62 and 75 bytes in my dataset.
Packet length in the S-C direction (normally containing the server’s responses information) is
more varied, mostly ranged between 77 and 276 bytes in my dataset. The S-C packet length
is impacted by the combination of the map and number of players participating in a particular
game, while the C-S packet length is due mostly to the behaviour of a particular player (for ex-
ample, C-S packets are shorter when a player is connected but idle, and longer when the player
starts playing) [201].
0 200 400 600 800 1000 1200 1400
0.0
0.2
0.4
0.6
0.8
1.0
Packet Length (Bytes)
Cum
ulat
ive
Dis
tribu
tion
Func
tion
(0−1
)
C−S directionS−C direction
Figure A.2: ET packet length in C-S and S-C directions
The ET packet rate in the S-C direction depends on the server’s update algorithm. Figure
A.3 shows a fairly consistent packet inter-arrival time of approximately 0.05 second. The Cu-
mulative Distribution Function (CDF) of S-C packet inter-arrival times jumps slightly at 0.1 or
0.15 second. This can be due to a lost packet or skipped packet if the ET server briefly rate-
limits its transmissions to particular clients. The packet inter-arrival times increase at multiples
of 50ms in these cases. In contrast, from C-S there is a wider range of packet inter-arrival time
values, which can be due to the diversity in graphic cards and maps used by particular clients
[201] and/or a choice of the client’s software to lower its sending rate due to slower speeds at
the client’s access links [176]3.2Data is collected from all packets of the full-flow samples.3The packet rates in C-S seen in my analysis are slightly lower than the results reported in [201]. This is because
254APPENDIX A. TRAFFIC CHARACTERISTICS OF SELECTED INTERNET APPLICATIONS
0.00 0.05 0.10 0.15
0.0
0.2
0.4
0.6
0.8
1.0
Packet IAT (seconds)
Cum
ulat
ive
Dis
tribu
tion
Func
tion
(0−1
)
C−S directionS−C direction
Figure A.3: ET packet inter-arrival time in C-S and S-C directions
For Web, Kazaa and SMTP traffic, due to the TCP transport protocol the asymmetry is min-
imal with packet inter-arrival times, as the sender rate is controlled by the receiver flow control
mechanism. However, packet length asymmetry in the C-S and S-C directions is significant, as
shown in Figures A.4, A.5 and A.6. Packet lengths in one direction are typically smaller than
those in the reverse direction due to the typical asymmetry in the application traffic (e.g. small
request packets in one direction versus long response packets in the reverse direction).
A.2 Variation of traffic statistics during flow lifetime
The variation of ET traffic statistics during a flow lifetime was presented in section 5.3.3. While
not as significant as those of ET traffic, Kazaa and SMTP flow statistics also change during the
lifetime of a flow. For example, the initial handshake of a new SMTP connection looks quite
different to the traffic while transferring the body of each email.
Figure A.7 presents the mean packet length of five consecutive packets in the C-S direction4,
taken at different points in time of the SMTP flows. I consider two different phases of the traffic
[201] did analysis of LAN players with high-speed links toward a local server. My data trace consists of widelydiversified geographically distributed players, connecting to the server via the Internet. The players can configuredtheir clients to work within lower Internet access rate limits, which leads to longer/lower average and peak packetinter-arrival times
4The three-way handshake at the beginning of SMTP traffic typically occurs within the first six packets ex-changed between a client and a server.
A.2. VARIATION OF TRAFFIC STATISTICS DURING FLOW LIFETIME 255
0 500 1000 1500
0.0
0.2
0.4
0.6
0.8
1.0
Packet Length (Bytes)
Cum
ulat
ive
Dis
tribu
tion
Func
tion
(0−1
)
C−S directionS−C direction
Figure A.4: HTTP packet length in C-S and S-C directions
0 500 1000 1500
0.0
0.2
0.4
0.6
0.8
1.0
Packet Length (Bytes)
Cum
ulat
ive
Dis
tribu
tion
Func
tion
(0−1
)
C−S directionS−C direction
Figure A.5: SMTP packet length in C-S and S-C directions
256APPENDIX A. TRAFFIC CHARACTERISTICS OF SELECTED INTERNET APPLICATIONS
0 500 1000 1500
0.0
0.2
0.4
0.6
0.8
1.0
Packet Length (Bytes)
Cum
ulat
ive
Dis
tribu
tion
Func
tion
(0−1
)
C−S directionS−C direction
Figure A.6: P2P packet length in C-S and S-C directions
flow during its lifetime: Starting (the beginning of the traffic flow) and In progress (the five
consecutive packets starting from the 10th packet). As shown in Figure A.7, the statistical
properties computed over five packets taken at different phases are different from each other,
and different from those calculated over a full-flow. Similar characteristics are seen with Kazaa
and HTTP traffic, as shown in Figure A.8 and Figure A.9.
0 500 1000 1500
0.0
0.2
0.4
0.6
0.8
1.0
Mean Packet Length (C−S) (bytes)
Cum
ulat
ive
Dis
tribu
tion
Func
tion
(0−1
)
StartingIn progressFull flow
Figure A.7: Packet length statistics calculated over five consecutive packets at different phasesduring a flow’s lifetime for SMTP traffic - C-S direction
A.2. VARIATION OF TRAFFIC STATISTICS DURING FLOW LIFETIME 257
0 500 1000 1500
0.0
0.2
0.4
0.6
0.8
1.0
Mean Packet Length (C−S) (bytes)
Cum
ulat
ive
Dis
tribu
tion
Func
tion
(0−1
)
StartingIn progressFull flow
Figure A.8: Packet length statistics calculated over five consecutive packets at different phasesduring a flow’s lifetime for Kazaa traffic - C-S direction
0 500 1000 1500
0.0
0.2
0.4
0.6
0.8
1.0
Mean Packet Length (S−C) (bytes)
Cum
ulat
ive
Dis
tribu
tion
Func
tion
(0−1
)
StartingIn progressFull flow
Figure A.9: Packet length statistics calculated over five consecutive packets at different phasesduring a flow’s lifetime for HTTP traffic - S-C direction
Appendix B
A Summary of ML-Based IP TrafficClassification Works in the Literature
B.1 A summary of key points for each reviewed work
Some key points for each work reviewed in Chapter 4 are summarised in Tables B.1, B.2, B.3,
and B.4.
B.2 A qualitative evaluation of the reviewed works
Table B.5 provides a qualitative summary of the reviewed works in Chapter 4 against the fol-
lowing criteria:
• Real-Time Classification
– No: The work makes use of features that require flow completion to compute (e.g.
flow duration, total flow bytes count)
– Yes: The work requires the capture of a small number of packets/bytes of a flow to
complete the classification
• Feature Computation Overhead
Low: The work makes use of a small number of features (e.g sizes of the first few
packets, binary encoding of the first few bytes of a uni-directional flow)
– Average: The work makes use of an average set of features (such as packet length
and inter-arrival times statistics, flow duration, bytes count)
258
B.2. A QUALITATIVE EVALUATION OF THE REVIEWED WORKS 259
– High: The work makes use of a large (comparatively with other work in the area),
computationally complex features (such as Fourier transform of packet inter-arrival
time)
• Continuous Classification
– Not addressed: The issue is not considered in the work
– Yes: The issue is considered and solved in the work
• Directional Neutrality
– No: The work makes use of bi-directinal flow and features calculations, but does
not consider the issue
– Yes: The work makes use of bi-directional flow and feature calculations, addresses
the issues and proposes solution
– N/A: The work makes use of uni-directional flow and the issue is not applicable
– Not clear: Not clearly stated in the paper
260 APPENDIX B. SUMMARY OF ML-BASED IP TC WORKS IN THE LITERATURE
Table B.1: A Summary of Research Reviewed in Chapter 4Work ML Algo-
rithmsFeatures Data
TracesTraffic Consid-ered
ClassificationLevel
McGregoret al. [59]
ExpectationMaximisation
Packet length statistics(min, max, quartiles, ...);Inter-arrival statistics;Byte counts; Connectionduration; Number oftransitions between trans-action mode and bulktransfer mode; Idle time;Calculated on full flows
NLANRandWaikatotrace
A mixture ofHTTP, SMTP,FTP (control),NTP, IMAP,DNS ...
Coarsegrained(bulk trans-fer, smalltrans-actions,multipletransactions...)
Zander etal. [60]
AutoClass Packet length statistics(mean and variance inforward and backwarddirections); Inter-arrivaltime statistics (meanand variance in for-ward and backwarddirections); Flow size(bytes); Flow duration;Calculated on full-flows
Auckland-VI,NZIX-II andLeipzig-IIfromNLANR
Half-Life,Napster, AOL,HTTP, DNS,SMTP, Telnet,FTP (data)
Fine grained(8 appli-cationsstudied)
Roughanet al. [61]
Nearest Neigh-bour, LinearDiscriminateAnalysis andQuadraticDiscriminantAnalysis
Packet Level; Flow Level;Connection Level; Intra-flow/Connection features;Multi-flow featuresCalculated on full flows
Waikatotrace andsectionlogs froma com-mercialstreamingservices
Telnet, FTP(data), Kazaa,Real MediaStreaming,DNS, HTTPS
Fine grained(three, fourand sevenclasses ofindividualapplica-tions)
Mooreand Zuev[98]
BayesianTechniques(Naive Bayesand NaiveBayes withKernel Estima-tion and FastCorrelation-Based Filtermethod)
Total of 248 fea-tures, among themare (detailed in [150]):Flow duration; TCPport; Packet inter-arrivaltime statistics; Pay-load size statistics;Effective bandwidthbased upon entropy;Fourier transform ofpacket inter-arrival time;Calculated on full flows
ProprietaryHandClassifiedTraces
A large rangeof Database,P2P, Buck,Mail, Services,... traffic
Coarsegrained
B.2. A QUALITATIVE EVALUATION OF THE REVIEWED WORKS 261
Table B.2: A Summary of Research Reviewed in Chapter 4 (continued)Work ML Algo-
rithmsFeatures Data
TracesTraffic Consid-ered
ClassificationLevel
Bernailleet al.[122]
Simple K-Means
Packet lengths of thefirst few packets ofbi-directional traffic flows
Proprietarytraces
eDonkey,FTP, HTTP,Kazaa, NTP,POP3, SMTP,SSH, HTTPS,POP3S
Fine grained(10 appli-cationsstudied)
Park et al.[152]
Naive Bayeswith KernelEstimation,Decision TreeJ48 and Re-duced ErrorPrunning Tree
Flow duration; Initial Ad-vertised Window bytes;Number of actual datapackets; Number of pack-ets with the option ofPUSH; Packet lengths;Advertised window bytes;Packet inter-arrival time;Size of total burst packets
NLANR,USC/ISI,CAIDA
WWW, Telnet,Chat (Mes-senger), FTP,P2P (Kazaa,Gnutella),Multimedia,SMTP, POP,IMAP, NDS,Oracle, X11
N/A (com-parisonwork)
Erman etal. [123]
K-Means Total number of packets;Mean packet length;Mean payload lengthexcluding headers; Num-ber of bytes transferred;Flow duration; Meaninter-arrival time
Self-collected8 1-hourcampustracesbetweenApril 6-9,2006
Web, P2P, FTP,Others
Coarsegrained (29differentprotocolsgroupedinto anumber ofapplicationcategoriesfor studies)
Crotti etal. [167]
Protocol fin-gerprints(ProbabilityDensity Func-tion vectors)and Anomalyscore (fromprotocol PDFsto protocolfingerprints)
Packet lengths; Inter-arrival time; Packetarrival order
6-monthself-collectedtraces atthe edgegatewayof theUniver-sity ofBres-cia datacentrenetwork
TCP applica-tions (HTTP,SMTP, POP3,SSH)
Fine grained(four TCPprotocols)
262 APPENDIX B. SUMMARY OF ML-BASED IP TC WORKS IN THE LITERATURE
Table B.3: A Summary of Research Reviewed in Chapter 4 (continued)Work ML Algo-
rithmsFeatures Data
TracesTraffic Consid-ered
ClassificationLevel
Ma et al.[202]
Unsupervisedlearning(product distri-bution, Markovprocesses, andcommon sub-string graphs)
Discrete byte encoding ofthe first n-bytes payloadof a TCP unidirectionalflow
Proprietary FTP (control),SMTP, POP3,IMAP, HTTPS,HTTP, SSH
Fine grained
Auld et al.[165]
Bayesian Neu-ral Network
246 features intotal, including:Flow metrics (dura-tion, packet-count, totalbytes); Packet inter-arrival time statistics;Size of TCP/IP controlfields; Total packets ineach direction and totalfor bi-directional flow;Payload size; Effectivebandwidth based uponentropy; Top-ten Fouriertransform components ofpacket inter-arrival timesfor each direction;
Proprietaryhand clas-sifiedtraces
A large rangeof Database,P2P, Buck,Mail, Services,Multimedia,Web ... traffic
Coarsegrained
Williamset al.[171]
Naive Bayeswith Dis-cretisation,Naive Bayeswith KernelEstimation,C4.5 DecisionTree, BayesianNetwork andNaive BayesTree
Protocol; Flow duration;Flow volume in bytes andpackets; Packet length(minimum, mean, maxi-mum and standard devi-ation); Inter-arrival timebetween packets (mini-mum, mean, maximumand standard deviation)
NLANR FTP(data),Telnet, SMTP,DNS, HTTP
N/A (Com-parisonwork)
Haffner etal. [172]
Naive Bayes,AdaBoost,RegularizedMaximumEntropy
Discrete byte encoding ofthe first n-bytes payloadof a TCP unidirectionalflow
Proprietary FTP (control),SMTP, POP3,IMAP, HTTPS,HTTP, SSH
Fine grained
B.2. A QUALITATIVE EVALUATION OF THE REVIEWED WORKS 263
Table B.4: A Summary of Research Reviewed in Chapter 4(continued)Work ML Algo-
rithmsFeatures Data
TracesTraffic Consid-ered
ClassificationLevel
Erman etal. [153]
K-Means, DB-SCAN and Au-toClass
Total number of pack-ets; Mean packet length;Mean payload length ex-cluding headers; Num-ber of bytes transfered(in each direction andcombined); Mean packetinter-arrival time
NLANRand a self-collected1-hourtrace fromthe Uni-versity ofCalgary
HTTP, P2P,SMTP,IMAP, POP3,MSSQL, Other
N/A (Com-parisonwork)
Erman etal. [170]
Naive Bayesand AutoClass
Total number of packets;Mean packet length (ineach direction and com-bined); Flow duration;Mean data packet length;Mean packet inter-arrivaltime
NLANR HTTP, SMTP,DNS, SOCKS,FTP(control),FTP (data),POP3,Limewire
N/A (Com-parisonwork)
Bonfiglioet al. [54]
Naive Bayesand Pearson’sChi-Square test
Message size (the lengthof the message encapsu-lated into the transportlayer protocol segment);Average inter-packet gap
Two selfcollecteddatasets
Skype traffic Applicationspecific
264 APPENDIX B. SUMMARY OF ML-BASED IP TC WORKS IN THE LITERATURE
Table B.5: Reviewed work in light of considerations for operational traffic classification
Work Real-timeClassification
Feature Com-putationOverhead
Classify FlowsIn Progress
Directionalneutrality
McGregor et al.[59]
No Average Not addressed No
Zander et al.[60]
No Average Not addressed No
Roughan et al.[61]
No Average Not addressed N/A
Moore andZuev [98]
No High Not addressed No
Bernaille et al.[122]
Yes Low Not addressed No
Park et al.[152]
No Average Not addressed Not clear
Erman et al.[123]
No Average Not addressed No
Crotti et al.[167]
Yes Average Not addressed No
Haffner et al.[172]
Yes Average Not addressed N/A
Ma et al. [202] No Average Not addressed NoAuld et al.[165]
No High Not addressed No
Williams et al.[171]
N/A Average N/A N/A
Erman et al.[153]
N/A Average N/A N/A
Erman et al.[170]
N/A Average N/A N/A
Bonfiglio et al.[54]
Yes Average Not addressed Not clear
Appendix C
Some Properties of Data Used for Trainingand Testing
This appendix presents some properties of the training and testing dataset used in Chapters 5,
6 and 7.
C.1 Geographical distribution of ET traffic
Figures C.1 and C.2 show the top 10 countries that contributed the greatest amount of ET
traffic (in terms of total bytes and number of flows) in the May 2005 (training) and September
2005 (testing) datasets.
Uni
ted
Sta
tes
Aus
tralia
Pol
and
Ger
man
y
Fran
ce
Finl
and
Net
herla
nds
Uni
ted
Kin
gdom
Can
ada
Bel
gium
Per
cent
age(
%) o
f tot
al fl
ows
0
5
10
15
20
(a) Percentage of flows
Aus
tralia
Pol
and
Fran
ce
Ger
man
y
Uni
ted
Sta
tes
Uni
ted
Kin
gdom
New
Zea
land
Finl
and
Sw
eden
Net
herla
nds
Per
cent
age(
%) o
f tot
al b
ytes
0
20
40
60
80
(b) Percentage of total bytes
Figure C.1: Top 10 countries that contributed the greatest amount of ET traffic in the trainingdataset
Figures C.1(b) and C.2(b) are more peaked than Figure C.1(a) and C.2(a) as most actual
long game flows are from Australia (where the server located).
265
266APPENDIX C. SOME PROPERTIES OF DATA USED FOR TRAINING AND TESTING
Aus
tralia
Uni
ted
Sta
tes
Pol
and
Ger
man
y
Fran
ce
Net
herla
nds
Finl
and
Uni
ted
Kin
gdom
Bel
gium
Can
ada
Per
cent
age(
%) o
f tot
al fl
ows
0
5
10
15
20
25
30
(a) Percentage of flows
Aus
tralia
Pol
and
New
Zea
land
Uni
ted
Sta
tes
Ger
man
y
Fran
ce
Net
herla
nds
Finl
and
Uni
ted
Kin
gdom Ita
ly
Per
cent
age(
%) o
f tot
al b
ytes
0
20
40
60
80
100
(b) Percentage of total bytes
Figure C.2: Top 10 Countries that contributed the most amount of ET traffic in the testingdataset
I estimate the number of hop counts for the distribution of client IP addresses based on the
TTL field’s values in each flow’s packet. My estimation method is based on similar assumptions
as those outlined in [203]. Initial TTL is usually a multiple of 32 and is decremented once at
each hop back towards the game server (the measurement point). Since the default TTL values
configured at the client machines are unknown, I assume that the maximum hop count is no
more than 32 hops. Therefore, the hop counts can be inferred from the TTL values as:
NumberO f HopCounts = ceiling(T T L/32)∗32−T T L
The topological spreading of game clients from international countries is illustrated by the
distributions of hops counts for game flows from most popular countries to the server. As can
be seen in Figure C.3, Australian clients are between 5 and 17 hops from the server while
international clients are at least 10 hops away from the server. This finding is consistent with
the study of [204]. Similar results are seen with the testing dataset.
C.2 Traffic mix for training and testing
This section provides details on the traffic mix for the training and testing of different classifiers
in Chapter 5. N is the size of the sliding window. I consider N = 10, 100 and 1,000 packets. M
is the number of packet offsets from the beginning of each flow.
C.2. TRAFFIC MIX FOR TRAINING AND TESTING 267
0 5 10 15 20 25 30
0.0
0.2
0.4
0.6
0.8
1.0
Hop Counts
Cum
mul
ativ
e D
istri
butio
n Fu
nctio
n
AustraliaUnited StatesFinlandPolandGermanyFranceNetherlandsUnited KingdomCanadaBrazil
Figure C.3: Cumulative distribution of client hop counts per country for the training dataset
268APPENDIX C. SOME PROPERTIES OF DATA USED FOR TRAINING AND TESTING
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
Mail etc.HalfLifeDNS etc.ETWebP2P0
1000
2000
3000
4000
5000
6000
7000
8000
9000
Num
ber o
f flo
ws
M (packets)
(a) Flow counts
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
Mail etc.HalfLifeDNS etc.ETWebP2P0
5101520
2530
35
40
45
50
Perc
enta
ge (%
)
M (Packets)
(b) Flow percentage
Figure C.4: Distribution of different applications’ traffic (in flows and percentage) in testingdatasets for N = 10
C.2. TRAFFIC MIX FOR TRAINING AND TESTING 269
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
Mail etc.HalfLifeDNS etc.ETWebP2P0
500
1000
1500
2000
2500
Num
ber o
f flo
ws
M (Packets)
(a) Flow counts
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
Mail etc.HalfLifeDNS etc.ETWebP2P0
5
10
15
20
25
30
35
40
45
Perc
enta
ge (%
)
M (Packets)
(b) Flow percentage
Figure C.5: Distribution of different applications’ traffic (in flows and percentage) in testingdatasets for N = 100
270APPENDIX C. SOME PROPERTIES OF DATA USED FOR TRAINING AND TESTING
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
Mail etc. HalfLife DNS etc. ET Web P2P 0
200 400 600
800
1000
1200
1400
1600
1800
Num
ber o
f flo
ws
M (Packets)
(a) Flow counts
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K
Mail etc. HalfLife DNS etc. ET Web P2P 0
10
20
30
40
50
60
70
Perc
enta
ge (%
)
M (Packets)
(b) Flow percentage
Figure C.6: Distribution of different applications’ traffic (in flows and percentage) in testingdatasets for N = 1000
Appendix D
Characteristics of VoIP Traffic
D.1 VoIP data extraction
While my data trace is a mixture of voice traffic and other applications, voice traffic is filtered
out as follows. Firstly, RTP traffic is filtered out using TShark 1.0.4, a common-line-based
version of Wireshark [205]. This tool dissects traffic protocol using deep packet inspection.
It can identify a UDP datagram as containing a packet of a particular protocol running on top
of UDP only if: the protocol has a particular standard port number, and the UDP source or
destination port number is that port; packets of that protocol can be identified by looking for a
‘signature’ of some type in the packet; or some other traffic earlier in the capture indicated that
traffic between two particular addresses and ports belong to the protocol [206].
When Tshark sees SIP/SDP traffic setting up an RTP session, the details of the RTP session
(source IP address and source UDP port number) will be identified from the SIP/SDP packets
and used to extract the subsequent RTP stream.
While this approach is sound in most cases, registering only the source IP address and
source port number may result in false positives in subsequent classification of RTP traffic. As
the conversation registration is set up indefinitely, other traffic initiated from the same IP address
and port number will be falsely classified as RTP traffic.
To eliminate false positives I then manually inspected all the flows believed by Tshark to be
RTP traffic. Anomaly flows are filtered out for quarantine, which are flows with an IP packet
length different from 200 bytes (G.711 PCMU voice packet), 73 bytes (GSM voice packets), 41
bytes (Comfort noise packets) or 44 bytes (Telephone-Event [207]).
Out of 666 RTP flows identified by Tshark, I found one RTP flow that contains 280-Bytes
271
272 APPENDIX D. CHARACTERISTICS OF VOIP TRAFFIC
packets, sending at 30ms in one direction. Closer inspection reveals that it contains G.711
PCMU packets, but instead of the 20ms packet interval by default, its packets are sent at 30ms
intervals (three 10ms frames), with a payload size of 240 bytes (80 Bytes/frame x 3) in one
direction. Most G.711 and GSM flows have a constant packet length in each direction (apart
from the presence of Comfort noise and Telephone-Event packets). There are eight exceptions
where the flows switched between G.711 and GSM in one direction, hence the flows contain a
mixture of 200 bytes and 73 bytes in one direction.
There were five DNS flows falsely classified as RTP traffic. The reason for this was that
these DNS flows had the same source IP address and source port as one previously registered
RTP session. There were also 17 video (H.263) flows encapsulated in RTP sessions. These
DNS and video flows were removed from the dataset. The remaining 644 RTP flows were then
used as benchmark VoIP flows for my analysis in Chapter 8.
D.2 Statistical properties of G.711 and GSM flows
This section summarises some statistical properties of my VoIP dataset. As shown in Figure
8.2, most G.711 voice packets are 200 bytes long. The sliding window’s mean packet length
sometimes falls below 200 bytes due to the presence of 41-byte comfort noise packets from
time to time. Figure 8.3 reveals that most packets arrive at 20ms intervals. However, there
are outliers that indicate a packet inter-arrival time of greater than 20ms. These longer packet
inter-arrival times are due to jitter, packet loss or silent periods during voice conversations.
Similar traffic characteristics are seen in the reverse direction.
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K FF
5010
015
020
0
M (Number of packets offset from the beginning of each flow)
Mea
n pa
cket
leng
th (B
ytes
)
Figure D.1: G.711 traffic - mean packet length - reverse direction
As shown in Figure 8.4, almost all GSM voice packets are 73 bytes long. There are only
D.2. STATISTICAL PROPERTIES OF G.711 AND GSM FLOWS 273
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K FF
020
4060
8010
0
M (Number of packets offset from the beginning of each flow)
Mea
n pa
cket
inte
r−ar
rival
tim
e (m
sec)
Figure D.2: G.711 traffic - mean packet inter-arrival time - reverse direction
a few outliers due to telephone-event packets. Figure 8.5 shows that most packets arrive at
20ms intervals. However, there are outliers that indicate a packet inter-arrival time of greater
than 20ms. These longer packet inter-arrival times are due to jitter, packet loss or silent periods
during voice converations. Similar traffic characteristics are seen in the reverse direction.
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K FF
5060
7080
9010
0
M (Number of packets offset from the beginning of each flow)
Mea
n pa
cket
leng
th (B
ytes
)
Figure D.3: GSM traffic - mean packet length - reverse direction
0 10 20 30 40 50 60 70 80 90 1K 2K 3K 4K 5K 6K 7K 8K 9K FF
010
2030
4050
60
M (Number of packets offset from the beginning of each flow)
Mea
n pa
cket
inte
r−ar
rival
tim
e (m
sec)
Figure D.4: GSM traffic - mean packet inter-arrival time - reverse direction
Appendix E
Trade-offs in Cluster Quality andClassifier Performance
In Chapter 6 I demonstrated that there are two options available when choosing an ‘optimal’
number of clusters: the pre-classification, and the post-classification option. The former has
been chosen to study in Chapter 6. In this appendix, I investigate the latter option.
In Chapter 6, the pre-classification option found eight ‘natural’ clusters from the 18 sub-
flows input from Step 1. From this I obtained eight representative sub-flows which were used to
train and test my Naive Bayes classifier. Using the post-classification option, I pre-specify EM
to start from two clusters. From the cluster results I obtain the representative sub-flows from
which to train and test my Naive Bayes classifier. I continue to add clusters until the accuracy
of the Naive Bayes classifier can no longer be increased.
My results reveal that the pre-classification option produces a classifier with higher Recall
and Precision and which is a little faster in classification, with a small trade-off in memory
usage and a longer time required to build the classifier (more than 16% longer, yet equivalent
to approximately only 15 seconds in my test platform). As it is simple, fully automated, and
independent of ML classification algorithm, this option can be used as a general approach to
assist in automated sub-flow selection. The only drawback of this option is the long clustering
time required, which can be overcome by using a smaller number of sub-flows instances as
demonstrated in Chapter 6. My experimental results are presented in the following sections.
274
E.1. ACCURACY 275
E.1 Accuracy
Figure E.1 presents Recall for classifiers trained using different numbers of clusters. The results
are illustrated using boxplot1. Using two clusters produces a very low Recall (a median of
19.3%), and using ≥ three clusters produces a median Recall of 94.8%. Median Recall reaches
97.8% at four clusters and remains almost the same at five clusters. The eight clusters selected
by the pre-classification process results in the maximum median Recall of 99%. It also produces
the most stable Recall result, with the smallest gap between the 25th and 75th percentiles of the
boxplot.
As outlined in Figure E.1, Recall generally increases as the number of clusters increases.
This is because using more clusters provides a better chance of covering more distinct phases
within a flow’s life time, and hence better and more stable Recall in classification. As soon as
there are sufficient clusters to cover all the phases of the full-flow, improvement with the addtion
of more clusters can stop. This explains the jump in Recall when selecting three clusters versus
two, as presented in Figure E.1, and the improvement is small for classifiers built using more
than three clusters.
Figure E.2 depicts the Precision of the classifiers. Two clusters produces a classifier with
the lowest median Precision of 81.9%. Both classifiers built based on four and five clusters
result in an almost identical median Precision of 93%. The eight clusters selected by the pre-
classification option produce a classifier with the highest median Precision of 93.3%. The
Precision results follow a similar curve to the Recall results, and Precision increases noticeably
when the number of clusters increases from two to three. When the number of cluters is greater
than three, the increase is very small.
The number of clusters can affect Precision as the inclusion of one or more clusters can
reduce or increase the unwanted range of feature values (the gap between the disjoint clusters)
to train a classifier, as illustrated in section 5.4.4. This subsequently can reduce or increase the
possibilities for false positives. Precision can also be affected by the ratio of traffic mix between
1The black line in the box indicates the median; the bottom and top of the box indicates the 25th and 75th per-centile, respectively. The vertical lines drawn from the box are whiskers. The upper cap is the largest observationthat is ≤ to the 75th percentile + 1.5*IQR (interquartile range - essentially the length of the box). The lower capis the smallest observation that is ≥ the 25th percentile - 1.5*IQR. Any observations beyond the caps are drawn asindividual points, which indicate outliers.
276APPENDIX E. TRADE-OFFS IN CLUSTER QUALITY AND CLASSIFIER PERFORMANCE
020
4060
8010
0
Rec
all(%
)
2 3 4 5 8
Number of clusters
Figure E.1: Recall for different classifiers trained using different number of clusters
the number of examples for Other traffic versus the number of examples for ET traffic used for
testing.
These are two possible factors that lead to the Precision results presented in Figure E.2.
Detailed analysis of the impact of each factor on Precision is left for future work. Nonethe-
less, there is little difference between the Precision for eight clusters and that produced for five
clusters (the former is only 0.3% higher)2.
In summary, my results demonstrate that using the pre-classification option produces the
best classifier in terms of both Precision and Recall. Using the post-classification option pro-
duces a classifier with a slightly lower Precision and Recall. However, the former needs to use
eight clusters, while for the latter Recall and Precision appear to reach an optimal point when
the number of clusters reaches four (after which the increase in Recall is insignificant) 3. Using
a smaller number of clusters (hence a smaller number of sub-flows instances to train the clas-
sifier) could reduce the time and memory required to build a classifier. This is evaluated in the
2These results are slightly different from the results I previously reported in a similar study [208]. This isbecause [208] used different examples of Other traffic to train and test the classifier. Furthermore, the ratio of ETtraffic versus Other traffic in the traffic mix for testing in [208] was different. The ratio was approximately 1:2 incontrast to approximately in the range between 1:10 and 1:5 used in my experiment (see section 5.3.4).
3There can be trade-offs in Precision and Recall, both of which may not converge to an optimal point using thesame number of clusters. A priority rule can be useful in the decision-making process. Analysis of these trade-offsis a subject for future research
E.2. COMPUTATIONAL PERFORMANCE 277
2040
6080
100
Pre
cisi
on(%
)
2 3 4 5 8
Number of clusters
Figure E.2: Recall for different classifiers trained using different number of clusters
next section.
E.2 Computational performance
Figure E.3 compares the normalised required build time for each classifier. A value of 1 rep-
resents the slowest build time (96 seconds on my test platform). The smaller the number of
clusters used, the faster will be the time needed to construct a classification model. This is
because the number of instances used to train the classifier increases as the number of clusters
increases. Building a classifier using eight clusters takes approximately 16% longer than using
only four clusters. However, this time difference is only equivalent to approximately 15 seconds
of CPU time.
Figure E.4 presents the normalised classification speed for the models with the same test
dataset. A value of 1 represents the fastest classification speed (3,984 classifications per second
on my test platform). In contrast to the required build time, the classification speed varies only
slightly with different numbers of clusters chosen. This is because the classification speed is
directly dependent on the classification rules, rather than on the number of instances used to
train the classifier. Classifier using the pre-classification option in the highest classification
speed. However the difference is small – 1.5% faster than in the case of the post-classfication
278APPENDIX E. TRADE-OFFS IN CLUSTER QUALITY AND CLASSIFIER PERFORMANCE
2 3 4 5 8
Number of clusters
Nor
mal
ised
Bui
ld T
ime
0.0
0.2
0.4
0.6
0.8
1.0
Figure E.3: Normalised build time for different classifiers trained using different number ofclusters
option.
Figure E.5 presents the normalised memory usage for the classification models while per-
forming 10-times cross-validation on their training datasets. A value of 1 represents the highest
memory consumption (304MB on my test platform). Although all models consume quite low
memory resources, the classifier built using a smaller number of clusters consumes fewer mem-
ory resources (due to smaller dataset). However, the differences are less than 3%.
Figure E.6 presents the normalised clustering time for different numbers of clusters. A value
of 1 represents the longest time ( approximately 5.4 hours on my test platform). The greater
the number of clusters specified, the longer the clustering algorithm will take in terms of CPU
time. In the pre-classification option, generating eight clusters requires up to 172 hours to com-
plete. This is because of the many repeated trial runs required in the WEKA cross-validation
implementation method. This clustering time can be vastly reduced by using a smaller number
of sub-flow instances in the clustering process, as demonstrated in Chapter 6.
E.2. COMPUTATIONAL PERFORMANCE 279
2 3 4 5 8
Number of clusters
Nor
mal
ised
Cla
ssifi
catio
n S
peed
0.0
0.2
0.4
0.6
0.8
1.0
Figure E.4: Normalised classification speed for different classifiers trained using different num-ber of clusters
2 3 4 5 8
Number of clusters
Nor
mal
ised
mem
ory
usag
e
0.0
0.2
0.4
0.6
0.8
1.0
Figure E.5: Normalised memory usage for different classifiers trained using different numberof clusters
280APPENDIX E. TRADE-OFFS IN CLUSTER QUALITY AND CLASSIFIER PERFORMANCE
2 3 4 5
Number of clusters
Nor
mal
ised
Clu
ster
ing
Tim
e
0.0
0.2
0.4
0.6
0.8
1.0
Figure E.6: Normalised clustering time for different classifiers trained using different numberof clusters