TTrees: Automated Classification of Causes of Network ...

10
TTrees: Automated Classification of Causes of Network Anomalies with Little Data Mohamed Moulay , Rafael Garcia Leiva * , Vincenzo Mancuso * , Pablo J. Rojo Maroni , Antonio Fernandez Anta * University Carlos III of Madrid, Spain * IMDEA Networks Institute, Madrid, Spain Nokia Networks, Madrid, Spain {mohamed.moulay, rafael.garcia, vincenzo.mancuso, antonio.fernandez}@imdea.org, [email protected] Abstract—Leveraging machine learning (ML) for the detection of network problems dates back to handling call-dropping issues in telephony. However, troubleshooting cellular networks is still a manual task, assigned to experts who monitor the network around the clock. We present here TTrees (from Troubleshooting Trees), a practical and interpretable ML software tool that implements a methodology we have designed to automate the identification of the causes of performance anomalies in a cellular network. This methodology is unsupervised and combines multiple ML algorithms (e.g., decision trees and clustering). TTrees requires small volumes of data and is quick at training. Our experiments using real data from operational commercial mobile networks show that TTrees can automatically identify and accurately classify network anomalies—e.g., cases for which a network low performance is not apparently justified by op- erational conditions—training with just a few hundreds of data samples, hence enabling precise troubleshooting actions. I. I NTRODUCTION Cellular networks have been through an exceptional evolu- tion in recent years. With 4G, and even more with 5G, network services have gained a significant degree of intelligence, involving intensive access to both data communication and computing resources. With the evolution of cellular networks, there has also come an increase in structural complexity and heterogeneity of services, requiring constant monitoring of the communication system. This landscape will only become more complex with the evolution of 5G and the emergence of 6G [1]. Most importantly, new systems will require intelligent and automatic troubleshooting tools, which is the focus of our work. We resort to simple and unsupervised ML tools to build an interpretable and robust machine learning (ML) methodology, which we have implemented and tested with real data. Beside interpretability, a key advantage of our proposal is that it only requires access to little volumes of training data. To provide customers with Quality of Service (QoS), many key performance indicators (KPIs) are continuously harvested to monitor network health. These indicators may measure the performance of radio, TCP, routing, etc. These measurements can be used to trigger troubleshooting actions. Specifically, to benchmark QoS in mobile networks, continuous drive tests with end-to-end test scenarios are performed every day internally by the network operators, and externally by third- parties or government regulators. Each of these test campaigns can have up to tens of thousands of individual test cases, from which specific KPIs are calculated. Drive tests are normally executed with off-the-shelf testing equipment (NEMO, TEMS, Swissqual, etc.), capable of running predefined sequences of tests and collecting relevant low level radio and traffic information, as well as application performance statistics. A line of work on monitoring and troubleshooting of cellular networks is the development of self-healing networks. Self- healing networks are responsible for detecting, identifying, and making decisions on recovery actions [2]. Multiple proposals exist for making fault detection and self-healing systems practical in mobile networks [3]. However, current self-healing troubleshooting proposals lack flexibility and do not scale well. There are newly-defined approaches based on ML, which use deep learning and neural networks (as black boxes) [4]. Unfortunately, these approaches lack interpretability, so it is not possible to understand the cause of a detected network problem, and manual intervention is required for classification after the detection. It is then possible to resort to Explainable AI, or XAI, which deals with the problem of how human users could understand AI’s cognition, and decide if an AI based model can be trusted or not [5]. However, XAI does not help interpret AI decisions and conclusions per se. Multiple meth- ods have been proposed to address the complex issue of ML interpretability, from determining which features contribute the most to a neural network’s output, to the development of targeted models that explain individual predictions [6]. In summary, troubleshooting remains a substantially manual procedure, in which highly skilled experts analyze alarms and statistics of performance indicators regularly to detect and diagnose the cause of problems in the network. In contrast, un- skilled engineers can not even detect problems effectively [7]. A. Original Contribution of the Work We propose an unsupervised methodology for the detection and clasification of network performance anomalies, and its software implementation, Troubleshooting Trees (TTrees). We join the ML research stream, while focusing on the automated detection and classification of possible network performance anomalies even with little volumes of measurements. Dif- ferently from existing ML proposals, we detect and classify anomalies by simply identifying the characteristics of what ML algorithms cannot learn/explain. For instance, a network that shows low throughput is not anomalous if this low KPI can be explained, e.g., by detecting the presence of a bad radio link. For this reason, a novel aspect of our work is the use of simple interpretable ML algorithms (like decision trees), moving away from the current trend of high-accuracy ML algorithms (e.g., deep learning) that do not allow in- terpretation (and hence understanding) of their outcome. In

Transcript of TTrees: Automated Classification of Causes of Network ...

Page 1: TTrees: Automated Classification of Causes of Network ...

TTrees: Automated Classification of Causes ofNetwork Anomalies with Little Data

Mohamed Moulay†, Rafael Garcia Leiva∗, Vincenzo Mancuso∗, Pablo J. Rojo Maroni‡, Antonio Fernandez Anta∗†University Carlos III of Madrid, Spain ∗IMDEA Networks Institute, Madrid, Spain ‡Nokia Networks, Madrid, Spain

{mohamed.moulay, rafael.garcia, vincenzo.mancuso, antonio.fernandez}@imdea.org, [email protected]

Abstract—Leveraging machine learning (ML) for the detectionof network problems dates back to handling call-dropping issuesin telephony. However, troubleshooting cellular networks is stilla manual task, assigned to experts who monitor the networkaround the clock. We present here TTrees (from TroubleshootingTrees), a practical and interpretable ML software tool thatimplements a methodology we have designed to automate theidentification of the causes of performance anomalies in acellular network. This methodology is unsupervised and combinesmultiple ML algorithms (e.g., decision trees and clustering).TTrees requires small volumes of data and is quick at training.

Our experiments using real data from operational commercialmobile networks show that TTrees can automatically identifyand accurately classify network anomalies—e.g., cases for whicha network low performance is not apparently justified by op-erational conditions—training with just a few hundreds of datasamples, hence enabling precise troubleshooting actions.

I. INTRODUCTION

Cellular networks have been through an exceptional evolu-tion in recent years. With 4G, and even more with 5G, networkservices have gained a significant degree of intelligence,involving intensive access to both data communication andcomputing resources. With the evolution of cellular networks,there has also come an increase in structural complexity andheterogeneity of services, requiring constant monitoring ofthe communication system. This landscape will only becomemore complex with the evolution of 5G and the emergence of6G [1]. Most importantly, new systems will require intelligentand automatic troubleshooting tools, which is the focus ofour work. We resort to simple and unsupervised ML toolsto build an interpretable and robust machine learning (ML)methodology, which we have implemented and tested with realdata. Beside interpretability, a key advantage of our proposalis that it only requires access to little volumes of training data.

To provide customers with Quality of Service (QoS), manykey performance indicators (KPIs) are continuously harvestedto monitor network health. These indicators may measure theperformance of radio, TCP, routing, etc. These measurementscan be used to trigger troubleshooting actions. Specifically,to benchmark QoS in mobile networks, continuous drivetests with end-to-end test scenarios are performed every dayinternally by the network operators, and externally by third-parties or government regulators. Each of these test campaignscan have up to tens of thousands of individual test cases, fromwhich specific KPIs are calculated. Drive tests are normallyexecuted with off-the-shelf testing equipment (NEMO, TEMS,Swissqual, etc.), capable of running predefined sequences

of tests and collecting relevant low level radio and trafficinformation, as well as application performance statistics.

A line of work on monitoring and troubleshooting of cellularnetworks is the development of self-healing networks. Self-healing networks are responsible for detecting, identifying, andmaking decisions on recovery actions [2]. Multiple proposalsexist for making fault detection and self-healing systemspractical in mobile networks [3]. However, current self-healingtroubleshooting proposals lack flexibility and do not scalewell. There are newly-defined approaches based on ML, whichuse deep learning and neural networks (as black boxes) [4].Unfortunately, these approaches lack interpretability, so it isnot possible to understand the cause of a detected networkproblem, and manual intervention is required for classificationafter the detection. It is then possible to resort to ExplainableAI, or XAI, which deals with the problem of how human userscould understand AI’s cognition, and decide if an AI basedmodel can be trusted or not [5]. However, XAI does not helpinterpret AI decisions and conclusions per se. Multiple meth-ods have been proposed to address the complex issue of MLinterpretability, from determining which features contributethe most to a neural network’s output, to the developmentof targeted models that explain individual predictions [6].

In summary, troubleshooting remains a substantially manualprocedure, in which highly skilled experts analyze alarms andstatistics of performance indicators regularly to detect anddiagnose the cause of problems in the network. In contrast, un-skilled engineers can not even detect problems effectively [7].

A. Original Contribution of the WorkWe propose an unsupervised methodology for the detection

and clasification of network performance anomalies, and itssoftware implementation, Troubleshooting Trees (TTrees). Wejoin the ML research stream, while focusing on the automateddetection and classification of possible network performanceanomalies even with little volumes of measurements. Dif-ferently from existing ML proposals, we detect and classifyanomalies by simply identifying the characteristics of whatML algorithms cannot learn/explain. For instance, a networkthat shows low throughput is not anomalous if this low KPIcan be explained, e.g., by detecting the presence of a badradio link. For this reason, a novel aspect of our work isthe use of simple interpretable ML algorithms (like decisiontrees), moving away from the current trend of high-accuracyML algorithms (e.g., deep learning) that do not allow in-terpretation (and hence understanding) of their outcome. In

Page 2: TTrees: Automated Classification of Causes of Network ...

fact, we prefer having lower accuracy, since we consider asinteresting (anomalous) the scenarios that are misclassifiedby the ML algorithms, and we do not want to miss themby overfitting. Understanding and classifying these anomalousscenarios allow alerting the appropriate department to takecorrective actions. Specifically, we use decision trees sincethey are transparent to inspection, hence interpretable [8].

We provide a Python open-source implementation of ourmethodology based on the Scikit-learn library [9] and a newlibrary that we have implemented.1

In order to evaluate the methodology, we use real oper-ational network data. Two of the datasets used here werecollected for cellular service auditing purposes by Nokiain various European countries. They include thousands offeatures (i.e., metrics), but are also quite limited in termsof number of data samples. Thus, they are not suitable forcommonly adopted deep neural networks and ML methodsrequiring complex training. The other dataset has been ob-tained in measurement campaigns with MONROE, an open-access platform for multi-homed experimentation with com-mercial mobile broadband networks across Europe [10]. WithMONROE, we have access to a large dataset, although with areduced number of metrics.

The experts in detection and classification of network is-sues from the research team have been instrumental in thevalidation of the methodology and the TTrees system. Theyhave manually inspected the outcome of the unsupervisedprocess, verifying that the scenarios that were classified asanomalous did in fact show strange combinations of per-formance indicators, and that their unsupervised classifica-tion into network aspects was consistent. Additionally, theyguided the development of a supervised implementation ofthe methodology, STrees, in which the KPIs were manuallyaggregated based on network aspects. The comparison of theoutcomes of unsupervised and supervised methods with thesame data has also been used to validate the methodology.

B. Organization of the Paper

The rest of the paper is organized as follows. Section IIdiscusses the related work. Section III overviews the unsu-pervised methodology proposed. Section IV explains in detailhow this methodology has been impemented as the TTreessystem. Section V presents and analyzes the results TTreesachieves when it is applied to real datasets. We summarizeand conclude the paper in Section VI.

II. RELATED WORK

Early works on fault detection suggested the use of time se-ries regression methods and Bayesian networks. For instance,Barco et al. [11] produced an automated tool for troubleshoot-ing mobile communication networks back in the days of 2G,relying on Bayesian networks to detect call drops. Khanafer etal. [12] followed up on proposing a method based on Bayesiannetworks to detect faults in UMTS systems, in which theyapply different algorithms to discrete KPIs. Other works, suchas [13], rely on a scoring-based system, in which the authorsbuild the fault detection subsystem around labeled fault cases.

1https://github.com/Mohmoulay/WoWMoM2021

These cases were previously identified by experts, using ascoring system to determine how well a specific case matcheseach diagnosis target. The work presented in [14] is based ona supervised genetic fuzzy algorithm that learns a fuzzy rulebase and, as such, relies on the existence of labeled trainingsets. Indeed, most of the techniques proposed in the literaturefocus on using supervised machine learning algorithms [12]–[14]. In this paper we show that it is convenient to useunsupervised techniques to unveil hidden information in theinput data, without restricting a priori the possible outcomes.

Other works make use of advanced mathematical andstatistical tools. For instance, Ciocarlie et al. [15] addressthe problem of checking the effect of network changes viamonitoring the state of the network, and determining if thechanges resulted in degradation. Their fault detection mech-anism uses Markov logic networks to create probabilisticrules that distinguish between different causes of problems.A framework for network monitoring and fault detection isintroduced in [16], using principal component analysis (PCA)for dimension reduction, and kernel-based semi-supervisedfuzzy clustering with an adaptive kernel parameter. To evaluatethe algorithms, they use data generated by means of an LTEsystem-level simulator. The authors claim that this frameworkproactively detects network anomalies associated with variousfault classes. These methods lack the flexibility of ML-basedones and, differently from our proposal, cannot be fullyautomated for a generic network context.

It is worth mentioning that most of the existing proposalshave been evaluated only through simulators, and require largedatasets. They help to detect network issues, but do not con-tribute to interpreting the network behavior. By comparison,in our work, we use not-necessarily-abundant data collectedin real operational networks and propose a fully automatedML-based methodology that leads to a straightforward inter-pretation of network behaviors. This includes identifying notonly the occurrence of problems but also their root causes.

III. OVERVIEW OF THE PROPOSED METHODOLOGY

In this section, we briefly describe the framework, and thesteps into which the automated unsupervised methodologicalprocess we propose is divided.

A. The Core Idea

Our methodology has been developed to find and interpretanomalies in a network, in an automatizable way. The coreidea behind our approach has been inspired by the con-cepts of Kolmogorov complexity, and in particular, by theminimum description length and minimum message lengthprinciples [17], [18], according to which, learning from data isequivalent to compress data (i.e., describe data in the shortestpossible way). Since classifying data points is a form ofcompression, we therefore conjecture that what a classifiercannot capture (hence the misclassification cases) are nothingbut non-learnable behaviors, i.e., anomalies. For this reason, inour methodology, we apply ML classifiers to find anomaliesby finding scenarios that cannot be learnt. Furthermore, wedesign the classifiers so as to be accurate and interpretable.

Page 3: TTrees: Automated Classification of Causes of Network ...

B. Input

The input to the process is a collection x of data samples,obtained from n network operation scenarios. Each data sam-ple xi includes the k performance indicators (xi1, . . . , xik)measured in scenario i ∈ [1, n]. In addition to these indicators,we assume that there is a distinguished set of ` target KPIs ywhich drives the search for anomalies. Hence, the dataset isformed by x and y, and each scenario i ∈ [1, n] is characterizedby the indicators (xi1, . . . , xik) and (yi1, . . . , yi`).

C. Discretization

The first step of the methodology is to group the datasamples into classes based on the value of the target KPIsy. The reason for having this step is because, in general,the target KPI values belong to an infinite domain.2 Notethat our methodology is oblivious to the process that collectsdata samples, as we simply take a dataset and work withall its valid entries. For discretization, it is desirable that thenumber of classes m is manageable but not too small, forexpressiveness. However, we do not need vast amounts of data,as our methodology works with as few as tens of samplesper discretization class, so as to have a bare minimum levelof statistical relevance. Additionally, it is desirable that datasamples are reasonably balanced among classes (although thismay not be achievable in all cases).

D. Selection of Anomalous ScenariosOnce the data samples have been assigned to the classes

based on their corresponding value of the target KPIs, weuse them to train a classifier C1. This classifier will use theindicators (xi1, . . . , xik) to determine whether, in scenario i,the KPIs (yi1, . . . , yi`) seem to belong to class j ∈ [1,m].The objective is to build a classifier that is able to correctlyclassify as many data samples as possible without overfitting.It is also desirable that the decisions of the classifier can beinterpreted.

Applying this classifier C1 to all the data samples willhence incorrectly classify a subset A ⊂ [1, n] of them. Thescenarios in set A are the ones that we consider interesting, oranomalous: scenarios whose target KPIs can not be correctlyclassified by a properly trained classifier. The interpretabilityof the classifier helps to determine why the data samples inA are anomalous (with respect to those properly classified).This means identifying which indicators are relevant and howto differentiate A from the rest.

Observe that the scenarios in A are not necessarily thosein which any of the target KPIs show low performance. Forinstance, if y consists of a single KPI, e.g., the averagethroughput observed, a scenario i with low throughput yi maynot be anomalous if this is due to a low radio coverage (i.e.,cell edge). This scenario should be correctly classified by clas-sifier C1 as bad, and a visual inspection of the radio indicatorsin xi should reveal the issue. The anomalous scenarios that areof interest to us are those that cannot be (directly) explained.For instance, a scenario i with low throughput yi in which theperformance indicators xi are all good.

2We explored options without discretization, but they were unsatisfactory,because they led to complex approaches with additional hyperparameters.

Hence, from this point on, our target will be to identifywhich types of anomalies the scenarios in A show. Thiscould be used to report the issue (complemented with thedata samples) to the appropriate department that may takecorrective action.

E. Selection of the Most Relevant Performance IndicatorsLet xA and yA be the set of anomalous scenarios xi and

yi, for i ∈ A. These are the scenarios that we want to explorefurther. A natural approach to attempt for understanding whatmakes these scenarios anomalous would be using anotherclassifier only for them. While this indeed separates thesescenarios into different classes, our experience has shown thatthe results are hard to interpret, even by an expert.

For this reason, for the sake of interpretability, in ourmethodology we proceed by restricting the set of performanceindicators for further analysis. Hence, in the next step of themethodology, we select a small subset R ⊂ [1, k] of KPIs.The set R should contain only non-redundant indicators, andshould contain those that are highly informative with respectto the set yA. The process we apply to select R is as follows.First, rank the performance indicators of xA by the amount ofinformation they provide about yA. Then, using this ranking,select the top indicators in order, removing those that areredundant with the previously selected. This process stopswhen a certain number of indicators have been selected, orwhen the next indicator in the rank gives little informationabout yA.

F. Clustering Using the Most Relevant IndicatorsThe next step of the methodology is to use the indicators

in R to classify the anomalous scenarios into meaningfultypes. We have observed empirically that the indicators in Rbelong to different network aspects (e.g., radio performanceindicators, TCP performance indicators), and that an anoma-lous scenario may show issues in several of these aspectssimultaneously. This leads to a process in which subsets r ⊆ Rare selected, and each subset is used to classify the anomalousscenarios using binary clustering indicating whether the se-lected subset of indicators is (at least partially) responsible forthe anomaly. The conjecture is that this (unsupervised) processwill select subsets that correspond to different network aspects,and the clustering will separate “good” scenarios from thosewith issues in that aspect. While this process is unsupervised,our experience is that it leads to subsets of R that an expertcan easily match with particular aspects of the network.

Hence, in this step we select α subsets ra ⊆ R, a ∈ [1, α],and apply each of them to divide the set A of anomalousscenarios into two clusters, bad and good. It is desirable tohave balanced, compact and non-redundant clusterings. Eachanomalous scenario will be assigned to one of the two clustersgenerated with each subset ra. This can be represented as abinary vector of length α, with a value bad or good in eachposition. This classifies all anomalous scenarios A into 2α

types. Each of the bits in the vector associated to a scenarioconveys whether that scenario shows issues in the particularnetwork aspect. However, observe that the methodology cannotassign a meaning to a given value bad or good in the vector,since it is unsupervised. To do that, an expert should inspectthe subsets and the clusters, and assign that meaning.

Page 4: TTrees: Automated Classification of Causes of Network ...

TTrees

DataPreparation

discretization(Proporionalapproach)

KnowledgeModel(DecisionTree)

MostRelevantindicators

(Mscd,JMI,MI)

Clusteringofanomalies

(k-means,InertiaNID)

AspectClassification(DecisionTree)

Fig. 1: Steps of TTrees (using multiple ML techniques):beginning with data preparation, proportional discretization,training of knowledge tree, selection of most relevant indi-cators, identification of anomaly clusters, and training of anetwork aspect anomaly classifier.

G. Aspect Classification of ScenariosFinally, a second classifier C2 is built using xA and yA as

training data, and the 2α types as the classes to which thesescenarios are assigned. Observe that we do not restrict theclassifier C2 to use only the indicators from R (although itis very likely that they will be used). It is desirable that thisclassifier is interpretable, so that an expert is able to understandwhy a scenario is considered to have issues in one particularnetwork aspect.

If an expert has been involved in the process, it shouldhave assigned a specific network aspect to each subset ra,and a meaning to every value (bad or good) of every bit inthe vector (the class output of the classifier). This informationcan be used to determine the issues each particular anomalousscenario has, and send it to the appropriate department.

H. Using the Classifiers to Detect and Identify AnomaliesAfter the previously described steps have been completed,

we found ourselves with two classifiers C1 and C2, that can beused to detect anomalous scenarios, determine network aspectsthat make this scenario anomalous, and hence notify about thisto the appropriate department for troubleshooting.

In particular, consider a new scenario(x1, . . . , xk, y1, . . . , y`). Using classifier C1 we determinethat this is an anomalous scenario if the class C1 assigned to(x1, . . . , xk) is not the one corresponding to the discretizationclasses of (y1, . . . , y`). If that is the case, C2 will be used,so that the class assigned to (x1, . . . , xk, y1, . . . , y`) is abinary vector of whether this scenario has issues in differentnetwork aspects. As mentioned before, if an expert identifiedthe network aspects that correspond to each bit of the binaryvector and which value of the bit identifies a type of issue,the vector can be used to report the anomalous scenario tothe corresponding departments.

IV. TTREES

Here we present the implementation details of the toolTTrees that automatizes the methodology presented in the

previous section. Fig. 1 depicts the different phases of theproposed methodology according to their implementation inTTrees. As reported in the figure, our implementation embedsa number of standard tools for data analysis and ML, fromdiscretization algorithms to clustering and classification.

A. Data Preparation in TTreesThe first phase consists in preparing the available data. This

is a necessarily preliminary step that includes data filtering,cleaning of the dataset from the presence of incompleteentries, and extracting the KPIs—the k indicators in x andthe target KPI y—that need to be observed (i.e., note that thegiven dataset might include more indicators than what we areinterested in). Thus, we use a Python script to extract x and yfrom the dataset. The script automatically discards entries inwhich one or more indicators are not reported. It also discardsindicators that are constant through the dataset. Indicators canbe either numeric values or categorical entries (e.g., labels usedto indicate the operational conditions of the network, like thename of the radio technology used, the name of the protocoladopted for transmission, etc.).

B. Discretization in TTreesThe selected target variable y is usually a numerical KPI,

and its values may lay on a continuous real interval. Thus, afirst problem to address in the training of TTrees consists indiscretizing the continuous target variable into a finite numberof classes (i.e., categories). Using a simple quantization of thevalues in y would introduce an error in the subsequent dataanalysis. Hence, instead we keep the target values as theyare in the dataset, and assign a label to each point, whichidentifies its class. Since TTrees is unsupervised, discretizationuses progressive numbers as labels (e.g., “0” to “5”).

The number of classes (labels) used is not pre-defined.Instead, it is automatically derived by TTrees from the avail-able number of points in the dataset. There exist a collectionof candidate discretization algorithms that are not biasedand have low variance (e.g., equal width, equal frequency,or fixed-frequency). Unfortunately, they require optimizinghyperparameters [19], which makes them inadequate for anunsupervised automated software tool. Instead, we use a dis-cretization approach that does not require any hyperparametertuning: proportional discretization [20]. This method uses anumber of categories proportional to the size of the dataset.The number of categories used, denoted by m, is computed asm = (log2 n)/2, where n is the number of samples of KPIsy. We use the proportional discretization approach to identifythe centroids of the m intervals of KPI values (associated tothe m categories). After identifying the centroids, we applyk-means clustering to the continuous target variable y, whichassigns labels to the data points and returns the boundariesbetween categories intervals.

C. Knowledge Model for Identifying AnomaliesOnce the data points y are assigned to a finite number of

categories, TTrees continues by building a knowledge model,meant to identify anomalies. For this step we use an MLclassifier, which we want to be interpretable, so as to allowthe user to understand what kind of anomaly is detected. This

Page 5: TTrees: Automated Classification of Causes of Network ...

is achieved by training a decision tree [21] with the data set x,and using as classes the categories defined in the discretizationphase. The decision tree infers the class for a sample yi of yfrom the values of the associated indicators xi. This justifiesthe use of terms “knowledge model” and “knowledge tree” forthe role of this classifier. However, some samples yi can bemisclassified, representing anomalous behavioral trends thatcannon be learned. Note that, with more advanced models(neural networks, random forests, etc.) instead of a decissiontree, we could gain accuracy, but we would lose interpretabilityand may discard interesting samples.

The knowledge tree is build by applying the widely adoptedClassification And Regression Tree (CART) algorithm [22],with the Gini impurity metric [23] computed by comparing theoutput of the classification with respect to the discretizationcategories. We refer the resulting decision tree as C1. Thedepth of C1 is limited to a maximum of b(log2 n)/2c since adeeper tree would suffer a high risk of overfitting [24]. We alsolimit the number of samples at the tree leaves to a minimumof five, to further reduce the risk of overfitting [23].

After the full tree has been derived to the maximum depthallowed, we perform a cost-complexity pruning, which isthe best known approach to avoid overfitting in potentiallylarge trees [23], combined with cross-validation of the can-didate pruning points. The leaves of the tree that have thehighest Gini impurity measure are removed first. Pruningminimizes a cost-complexity metric that linearly combinesthe classification error (cost) made by the tree and the sizeof the tree (complexity). However, the pruning metric hasa hyperparameter which represents the relative importanceof cost with respect to complexity. Cross-validation is usedto tune that hyperparameter automatically. In practice, foreach candidate value of the hyperparameter, we identify thetree with the minimal cost-complexity, and cross-validate theperformance of that tree by splitting the dataset into smallerpieces of at lest 30 samples each: we use all but one of thepieces to train the tree and the remaining one for evaluation.This process is repeated as many times as the number of piecesin which we break the original dataset (so as to use every piecefor validation exactly once).

Once C1 is created, for each sample of y we have twopossibilities: the sample is classified or not under the sameclass in the discretization phase and by the knowledge tree. Ifnot, an anomaly is identified.

D. Most Relevant TTrees IndicatorsThe next step consists in identifying a subset of indicators

that are relevant for the anomalous scenarios in A, so asto identify which network aspects are relevant to explainthe anomaly. TTrees then evaluates each of the availableindicators by computing how much information on the targety is contained in the indicator. For this task we have usedthe well known mutual information (MI) [25] and joint mu-tual information (JMI) [26] metrics from classic informationtheory.

We have also tested a novel metric that we have built ontop of the concepts of Kolmogorov complexity and normalizedcompression distance (NCD) [27] which we call miscoding(Mscd). The miscoding of a sequence of j ≥ 1 different

indicators xJ = 〈xi1 , xi2 , . . . xij 〉 with respect to the targety is defined as

Mscd(xJ , y) =1− NCD(xJ , y)∑

i1,i2,...ij(1− NCD(xJ , y))

. (1)

We propose Mscd because it measures the difficulty of re-constructing the target y from the indicators xJ and viceversa, which accounts for the redundancy in xJ . In TTreeswe start by computing a conditional redundancy matrix withthe normalized compression distance of all possible pairs ofindicators with respect to the target variable NCD(〈xi, xj〉, y),based on a joint discretization of the attributes, and the com-putation of optimal length codes, given the relative frequenciesof the discretized vectors. Then, we select the attribute withthe highest NCD, and recompute the values of the redundancymatrix assuming that this value is selected. We repeat thisprocess of selection of the best feature and recalculation of theredundancy matrix until all the indicators have been selected.The final miscoding is given by normalizing over one minusthe NCDs computed for the different attributes.

MI is applied on a per-indicator basis and is able to quantifythe relevance of an indicator, but it fails to identify redundancy.JMI is instead used on indicator pairs, so that it allows toevaluate not only the relevance of an indicator but also theredundancy with another indicator. However, JMI is prone toerrors in case outliers are present. Mscd offers the possibilityof evaluating relevance and redundancy, plus the quantity ofirrelevant information contained in the indicator.

We will compare the performance obtained by applying MI,JMI and Mscd in the numerical evaluation section. Here itis enough to mention that in all of the three cases, we sortthe indicators in decreasing order according to the selectedmetric, and pass the top of the list to the next TTrees phase.In particular, since we will need to cluster the anomalies basedon selected network aspects, we pass to the clustering phase anumber of indicators much larger than the number of aspectsto study (in the current implementation, α2, where α is thenumber of aspects, which is fixed a priori).

E. Clustering of Anomalies in TTreesEach pair of relevant indicators is regarded as a potential

network aspect. For each aspect, TTrees applies a cluster-ing algorithm on all anomalous samples. This bidimensionalclustering is done with the k-means algorithm [23], usingthe normalized values of the indicators obtained via Ro-bustScaler [28]. In addition, and for visualization purposesonly, when we have just one target KPI, we use a simpleregression with respect to y on each of the indicators, andwe sort samples according to increasing regression values.This allows to plot anomalous samples consistently so thathigher y values are at the top-right corner of the plot. As aconsequence, the cluster of samples at the top-right cornercontains the samples with highest y values and the cluster atthe bottom-left corner the samples with lowest y values. If thetarget KPI y is better when larger (resp., smaller), then thebottom-left (resp., top-right) cluster is the one with issues inthe network aspect defined by the two indicators. In general,the cluster with issues is assigned a bit 1 in this aspect, andthe other cluster is asigned a bit 0.

Page 6: TTrees: Automated Classification of Causes of Network ...

TABLE I: Brief description of the experimental datasets used in this workID Dataset # Samples # Indicators Families of performance indicators#1 HTTP FILE DL 6, 690 228 Radio, RTT, Duration, TCP Volume, TCP Flags, TCP Window, Packet Anomaly#2 HTTP LIVEPAGE DL 10, 500 126 Radio, RTT, Duration, TCP Volume, TCP Flags, TCP Window, Packet Anomaly#3 HTTP FILE DL MONROE 120, 000 106 Radio, RTT, TCP Window, Packet Anomaly

The problem is that we have potentially a large numberof pairs of indicators (network aspects), and some of themmight bring redundant information. Therefore, TTrees selectsonly a few pairs of indicators, so as to identify non-redundantdescriptions of anomalies. In our current implementation weselect α = 4 indicator pairs, i.e., the 4 most relevant aspectsto evaluate to understand the anomaly. Thus, the number ofindicators passed from the previous step is p = α2 = 16. Wedo not use all possible indicators for a matter of practicality.For example, with p = 120 indicators, which is a reasonablevalue for cellular data traffic traces, we would have to evaluatep(p − 1) = 14280 pairs, and make hundreds of millions ofcomparisons to test the redundancy among pairs.

The most relevant and descriptive indicator pairs are se-lected based on multiple metrics. First of all, TTrees computesthe inertia score [29] of the clusters resulting from eachindicator pair. The inertia of a cluster quantifies the tightnessof the clusters (the lower, the better). The resulting ranked listis filtered by using two criteria: TTrees only keeps clusterswith low redundancy and which are balanced in the numberof elements in the two clusters. Redundancy of clusters iscomputed using the normalized information distance metric(NID) [30]. Here the order matters, since if two indicator pairsyield redundant clusters, the indicator pair with lower inertia isretained, while the other is discarded. In order to automaticallytune and apply both filters, we select filtering thresholds withthe help of the histograms obtained for the distributions ofredundancy scores and balance metric. We have observed thatthe histogram of redundancy shows a bimodal distribution,therefore we select as threshold to accept/reject an indicatorpair the point with the lowest value in-between the two peaksof the bimodal distribution. For balancedness, the distributionhistogram is multimodal, with an automatic threshold selectionwhere the minimum point between first and second peaks(minimum accepted value of balancedness), and the minimumpoint between the last two peaks (maximum accepted valuefor balancedness).

F. Aspect Classification in TTreesThe α top indicator pairs obtained in the previous phase of

TTrees, identify network aspects in which the anomaly couldbe rooted. In this phase they are used to train a classifier (againa decision tree) to map the anomalies A onto the 2α classescorresponding to the combinations of the α network aspectsrelevant for the anomalies. This aspect classifier, denoted asC2, uses the the binary values of the aspects assigned in theprevious phase, and builds new categorical target variablesconsisting of binary strings that combine these binary aspectvalues. The bits with value 1 in a category identify aspectsin which the scenarios of that category may have issues, andfor which a network specialist should be called. Note that thisclassifier makes decisions based on the full list of indicators,and not only based on the most relevant indicators identified byTTrees. In the training of C2, like for C1, we avoid overfitting

by limiting the depth of the tree to b(log2 |A|) /2c, and applycost/complexity pruning with cross validation to increase thegeneralization level of the tree.

G. Software ImplementationTTrees is implemented using Python. We use two libraries:

(i) the widely adopted scikit-learn library, which provides uswith ML algorithms, such as decision trees and k-means, and(ii) a library developed by us for automatic ML tools, whichincludes an Mscd calculator.

V. NUMERICAL EVALUATION

Existing troubleshooting tools do not generalize very welland need the assistance of an expert. For this reason, here weevaluate TTrees along with a supervised version of our tool,which provides a validation baseline. For simplicity, here weuse only one target KPI (i.e., the throughput) and restrict theanalysis of anomalies to the case in which C2 returns a betterclass than C1, i.e., when the knowledge model predicts betterresults than discretized observed KPI.

A. Supervised ML-based TroubleshootingThe supervised version of TTrees, denoted as STrees, uses a

supervised selection of the indicators that are most relevant forthe detected anomalies. In particular, with the supervision ofa network expert, STrees selects the most relevant indicatorsbased on KPI families identified by the expert. The identifiedfamilies range from radio-related indicators (e.g., radio buffer,radio coverage, etc.) to TCP-related indicators (e.g., RTT,TCP Window, TCP flags, etc.). The same steps described inSection IV-E for TTrees apply to STrees. However, the expertin the field of cellular networks carefully labels each clusteras anomalous data points with radio, buffering, or TCP issues.Afterward, in STrees, we take the tagged classes and use themto train C2, i.e., the final aspect classification tree, like is donein TTrees.

B. DatasetsWe first analyze two data collections of cellular networking

experiments used by Nokia in 2019 for network performanceassessment of various European 4G networks. These twodatasets contain drive-test information on TCP traffic exper-iments in which constant-size (3 MB) files are downloaded,and live web-pages as fetched with a mobile user device. Then,we further evaluate TTrees using a larger mobile broadbandconnectivity dataset collected with MONROE [31].

More in detail, the Nokia datasets used for experimentalvalidation are real test records generated by processing allthe information provided by the testing equipment used inthe drive-tests, and aggregating the information at test level(count, sum, min, max, average, percentiles, etc.). The re-sulting datasets have a single row per test and hundreds ofcolumns summarizing all the dimensions (date, time, loca-tion, network element information, etc.) and indicators (radio,

Page 7: TTrees: Automated Classification of Causes of Network ...

0 1 2 3 4 5Throughput (Kbit/s)

0

500

1000

1500

2000

2500

Coun

t

Very bad

BadOk

Good

Very goodExcellent

Fig. 2: Target KPI (throughput) distribution for Dataset #1 viathe proportional discretization approach

TCP/IP, application, etc.) related to that particular test. Thedata transmitted in the drive tests is synthetic and does notinclude customer’s sensitive information, protected by theEuropean Union General Data Protection Regulation (GDPR)or similar regulations (i.e., the data has been generated bya testing device and not by real users). Finally, potentiallysensitive data, such as the identity of the network operator, hasbeen anonymized. The Nokia datasets utilized in this study arerich but not sufficiently large to, e.g., train a neural network.They contain 54 228 and 21 655 rows, respectively, with 1326columns.

The third dataset we use is a MONROE dataset with 120 000rows and 106 columns. We have obtained access to MONROEand replicated the experiments of the first Nokia dataset (i.e.,with 3 MB HTTP file downloads) on a larger scale, althoughwith a different number of performance indicators. Moreover,some of the MONROE dataset highlights include the abilityto filter by test type, infrastructure, operator, vendor, andtransmission technology.

In the next section, we use the first Nokia dataset to validateTTrees and showcase its troubleshooting capabilities for thecase of HTTP file download operations over LTE networks.We only consider data for a single operator (we anonymizethe operator’s name for privacy purposes). The second Nokiadataset is for HTTP webpage download performed live, overLTE networks, and one single operator as well. The MONROEdataset is for HTTP file download over LTE networks, andwith one single operator, with tests anonymously conductedfrom labs and public buses. Table I summarizes the featuresof the datasets used.

C. TTrees in Action and its Validation

For the analysis of the HTTP file download dataset of Nokia,we choose the throughput as the desired target KPI. Fig. 2illustrates the results obtained when applying the proportionaldiscretization approach to this KPI in TTrees. Since there are6690 samples, the number of categories used for this specificcase is b(log2 6690)/2c = 6. The discretized target variableis fed to the CART algorithm to build the knowledge tree. InCART, we fit the knowledge tree using all the performanceindicators available in the dataset.

Fig. 3 explains a subset of the resulting knowledge tree. Fora given sample at the root of the node, the CART algorithmchecks if the Abs RWIN FirstSec indicator (the absolutevalue of the TCP received window during the first second ofa connection) is less or equal to the self-learned threshold ofabout 1.3 MB. If this statement is true, the CART algorithm

Abs_RWIN_FirstSec<=1351112.625Family:TCPWINDOW

Abs_PushEvents_TimeStep_2560ms<=0.5Family:TCPFLAGS

ack_pkts_sent_a2b<=1108.5Family:TCPFLAGS

End.RSSI.dBm>-58.5Family:Radio

Abs_RWIN_25.<=593712.0Family:TCPWINDOW

OkSamples=18Prob=0.56

Samples=6Bad

Prob=1

KnowledgeTree

SubsetoftheTree

Abs_CWIN_VolStep_90KB<=27311.609Family:TCPWINDOW

BadSamples=17Prob=0.73

VeryBadSamples=14Prob=0.4

True False

Abs_RWIN_FirstSec<=1351112.625Family:TCPWINDOW

Abs_PushEvents_TimeStep_2560ms<=0.5Family:TCPFLAGS

ack_pkts_sent_a2b<=1108.5Family:TCPFLAGS

End.RSSI.dBm>-58.5Family:Radio

End.SINR.dB<=1.5Family:Radio

FalseTrue

Fig. 3: Zoom into a subset of the knowledge tree built forDataset #1. The figure also reports which family of indicatorsthe branching variable belongs to (see Table I).

−4 −2 0 2 4Difference

0

1000

2000

3000

4000

5000

Coun

t

Fig. 4: Difference (predicted - discretized) between the classassigned by the knowledge tree and the discretized categoryfor the target KPI of Dataset #1

moves to the left side of the tree. Each node of the knowledgetree reports the value of Gini impurity, i.e., a measure of thefraction of misclassified data points that are reported in thetree leaves that can be reached from that node. It also reportsthe total number of samples examined in the node (6690 atthe root) and the number of samples that corresponds to eachdiscretization category. E.g., [2185, 1646, 1293, 950, 506, 110]in the root node correspond to the number of samples that fallin the “Very bad” (0), “Bad” (1), “OK” (2), “Good” (3), “VeryGood” (4), and “Excellent” (5) categories. The class valueassociated to each node is the one with the highest number ofsamples (“Very bad”, in the example).

The full knowledge tree generates a meaningful set ofclassification rules, each based on a performance indicator, andwhose family (radio, TCP window, TCP flags, etc.) is reportedin each node of the tree. That can help expert troubleshootersget an initial idea of what is going on in the cellular network,but since the full knowledge tree is wide, manual inspectionand interpretation can be cumbersome.3

Fig. 4 shows that the knowledge tree classifies the targetsamples with high accuracy, although the number of cases inwhich the predicted class is better than the one observed inthe discretization phase is not negligible (‘-1’ in the figure).TTrees selects these 740 anomalies to continue the analysis.

3The full tree is large and cannot be correctly visualized here, so we madeit available online at https://github.com/Mohmoulay/WoWMoM2021.

Page 8: TTrees: Automated Classification of Causes of Network ...

Abs_

RTT_

avg

Abs_

CWIN

_Vol

Step

_630

KBAb

s_RT

T_Vo

lSte

p_30

KBAb

s_W

INRa

tio_2

5.Ab

s_RT

T_Vo

lSte

p_63

0KB

Abs_

RWIN

_Firs

tSec

Abs_

WIN

Ratio

_Vol

Step

_90K

BAb

s_RT

T_m

axAb

s_RT

T_Vo

lSte

p_90

KBac

k_pk

ts_s

ent_

a2b

Star

t.SIN

R.dB

Abs_

WIN

Ratio

_Vol

Step

_432

0KB

Abs_

RTT_

VolS

tep_

1650

KBTi

me.

to.F

irst.B

yte.

sAb

s_RT

T_Vo

lSte

p_43

20KB

Abs_

CWIN

_Vol

Step

_240

KB

Performance indicators

0.004

0.005

0.006

0.007

0.008

0.009

Msc

d

(a) MiscodingAb

s_RT

T_av

gam

bigu

ous_

acks

_a2b

Abs_

RTT_

Firs

tMB

Abs_

RTT_

Firs

tSec

Abs_

RWIN

_avg

Abs_

RWIN

_Firs

tSec

Abs_

WIN

Ratio

_avg

Abs_

WIN

Ratio

_Firs

tMB

Abs_

WIN

Ratio

_Firs

tSec

avg_

win_

adv_

a2b

Abs_

RTT_

VolS

tep_

30KB

Abs_

WIN

Ratio

_Vol

Step

_630

KBAb

s_RT

T_Vo

lSte

p_63

0KB

Abs_

RTT_

VolS

tep_

4320

KBAb

s_RT

T_Vo

lSte

p_24

0KB

Abs_

WIN

Ratio

_Vol

Step

_240

KB

Performance indicators

0.0

0.2

0.4

0.6

0.8 Mutual InformationJoint Mutual Information

(b) MI and JMI

Fig. 5: Ranking of indicators according to their relevance tothe description of anomalies for Dataset #1

0 50 100Cluster (ID)

020406080

100

Clus

ter (

ID)

(a) MI-based selection

0 50 100Cluster (ID)

020406080

100

Clus

ter (

ID)

(b) JMI-based selection

Fig. 6: NID heatmap matrix for cluster pairs. Darker colorsrepresent redundancy.

The number of network aspects that can be analyzed bycomposing binary clusters in the aspect classifier C2 with apopulation of 740 anomalies is α = b(log2 740)/2c = 4. Thus,the number of performance indicators to use is α2 = 16. Fig. 5depicts the 16 most relevant performance indicators obtainedby using the Mscd metric (left subplot) and by the MI and JMImetrics (right subplot). The figure clearly shows that MI andJMI practically find the same set of performance indicators,while Mscd identifies a significantly different subset, whichis due to the fact that Mscd accounts for redundancy ofperformance indicators while MI and JMI do not.

In Figs. 6 and 7(a) the redundancy of the binary clustersobtained with the 16 most relevant performance indicatorsis shown, for metrics MI, JMI, and Mscd, respectively. Theheatmaps shown in the figures represent the NID metricbetween cluster pairs: the darker the color the lower thedistance between cluster pairs (and hence the higher theirredundancy). Fig. 7(a) shows lighter colors than the other twoheatmaps of Fig. 6, which means that Mscd yields indicatorswhose combinations of clusters have lower redundancy. Forthis reason we will mainly consider the indicators selectedwith Mscd in the rest of this section.

Fig. 7(b) illustrates the inertia of the clusters obtainedwith the Mscd indicators, and the filtering process basedon balance and low redundancy criteria. The output of thisfiltering process is a set of 4 network aspects (in practice, 4pairs of performance indicators) used for binary clustering.

0 50 100Cluster (ID)

020406080

100

Clus

ter (

ID)

(a) NID of pairs of clusters built fromindicators selected with Mscd

Pairs of relevant Indicators

0

1000

2000

3000

Iner

tia

UnbalancedRedundantPickedOthers

(b) Inertia of clusters for all combina-tions of relevant indicators

Fig. 7: Network aspect selection based on the use of miscod-ing; lower inertia values of k-means clusters are preferred,unless clusters are redundant or unbalanced.

0.0 0.1 0.2 0.3 0.4Abs_WINRatio_25.

0.00

0.25

0.50

0.75

Abs_

WIN

Ratio

_Vol

Step

_423

0KB

[Byt

es]

1 (H)0 (L)

(a) Aspect 1

0.0 0.1 0.2 0.3 0.4Abs_WINRatio_25.

0

20

40

Star

t.SIN

R.dB

[dB]

1 (H)0 (L)

(b) Aspect 2

0.0 0.1 0.2 0.3 0.4Abs_WINRatio_25.

0

2000

4000ack_

pkts

_sen

t_a2

b

1 (H)0 (L)

(c) Aspect 3

0100200300Abs_RTT_VolStep_630KB [ms]

0

20

40

Star

t.SIN

R.dB

[dB]

1 (H)0 (L)

(d) Aspect 4Fig. 8: Aspects obtained with TTrees for Dataset #1 (usingMscd, see Figs. 5 and 7). The parameters used are Signal-to-interference-plus-noise ratio (SINR), Number of packetacknowledgment sent, absolute TCP Window ratio, and RTT.

Due to lack of space, the visualization of the filtering resultsfor the selection of relevant aspects obtained with MI and JMIis omitted. Results on the performance obtained with thosemetrics will be commented later and are reported in Table II.

Fig. 8 shows the 4 network aspects obtained with the firstNokia dataset and the corresponding clusters to be used inthe aspect classifier C2. The figure shows that TTrees hasidentified anomaly clusters characterized by TCP operationalparameters, radio parameters, delay parameters and packetanomaly. In STrees, instead, an expert would have to lookmanually at the “usual suspects” for the presence of anoma-lies, such as possible radio strength problems, TCP-relatedproblem, packet anomaly and high RTT caused by bufferingand cluster them to see if they yield any groups. In thiscase, a Nokia cellular expert has identified for us the mostrelevant indicators, which led STrees to identify the network

Page 9: TTrees: Automated Classification of Causes of Network ...

0.0 0.2 0.4 0.6Abs_WINRatio_50

0.00

0.25

0.50

0.75

1.00

1.25

Abs_

CWIN

_Vol

Step

_432

0KB

[Byt

es]

1e61 (TCP OK)0 (TCP Problem)

(a) Aspect 1 (TCP Window)

−100 −80 −60 −40Start.RSSI.dBm [dBm]

−10

0

10

20

30

End.

SINR

.dB

[dB]

1 (Good Radio)0 (Bad Radio)

(b) Aspect 2 (Radio coverage)

0 200 400 600Abs_RTT_50 [ms]

100

200

300

400

500

Abs_

RTT_

avg.

[ms]

1 (High RTT)0 (Low RTT)

(c) Aspect 3 (RTT)

0 250 500 750 1000Abs_PacketLost_sum

0100200300400500

Abs_

Pack

etLo

st_F

irstS

ec.

1 (High PacketLost)0 (Low PacketLost)

(d) Aspect 4 (Packet anomaly)

Fig. 9: Aspects obtained with STree for Dataset #1

AspectClassifierTTrees

Abs_RTT_avg<=122.432Family:RTT

Abs_CWIN_VolStep_4320KB>542403.75Family:TCPWINDOW

sack_pkts_sent_a2b<=1322.0

NoanomalydetectedSamples=121Prob=0.65

Samples=46D1PC;D2PC;D3PC;D4NC;

Prob=0.5

True False

SubsetoftheTree

Family:PacketAnomaly

Abs_RTT_FirstMB>62.343Family:RTT

Abs_RTT_VolStep_630KB<=87.909Family:RTT

ack_pkts_sent_a2b>2212.0

Samples=26

GoodRadio/TCPOKLowRTT/HighPacketloss

Prob=0.81

False

Abs_RTT_FirstMB>59.871Family:RTT

Family:PacketAnomaly

Abs_CWIN_VolStep_4320KB<=589832.719Family:TCPWINDOW

Start.RSRP.dBm<=-85.5Family:Radio

Samples=29

BadRadio/TCPOKLowRTT/HighPacketloss

Prob=0.72

True

SubsetoftheTree

Abs_RTT_VolStep_90KB<=53.359Family:RTT

Abs_RTT_avg<=122.432Family:RTT

Abs_CWIN_VolStep_4320KB>542403.75Family:TCPWINDOW

sack_pkts_sent_a2b>1322.0Family:PacketAnomaly

Abs_WINRatio_25.>0.19Family:TCPWINDOW

Abs_SegmentSizes_FirstSec<=1307.094Family:DataVolume

Abs_WINRatio_50.<=0.387Family:TCPWINDOW

Samples=8D1PC;D2PC;D3NC;D4NC;

Prob=0.42

FalseTrue

Abs_RTT_avg<=122.432

Abs_RWIN_FirstSec<=1314405.5Family:TCPWINDOW

Family:RTT

ack_pkts_sent_a2b>1545.0

Abs_RTT_FirstMB<=59.871Family:RTT

Family:PacketAnomaly

Abs_CWIN_VolStep_4320KB<=589832.719Family:TCPWINDOW

Abs_RTT_avg<=122.432

Abs_RWIN_FirstSec<=1314405.5Family:TCPWINDOW

Family:RTT

Samples=29

GoodRadio/TCPOKLowRTT/HighPacketloss

Prob=0.40Samples=29Prob=0.9

FalseTrue

AspectClassifierSTrees

NoanomalydetectedSamples=18Prob=0.69

Noanomalydetected

Fig. 10: A comparison between the supervised and unsuper-vised approach showing the resulting aspect classifiers

aspects visualized in Fig. 9. Aspects identified by TTrees andSTrees are quite different, but in both cases they cover thesame families (radio, TCP window, RTT, packet anomaly).The actual performance indicators selected by TTrees couldbe less intuitive for an expert (e.g., selecting number ofacknowledgment instead of packet loss), but without loss ininformation. In reality, for an expert, Aspect 3 from Fig. 8makes sense but may not be the first (intuitive) choice; instead,he/she would first look at the packet loss from Fig. 9. Indeed,one of the advantages of the TTrees methodology over STreesis the capability of creating meaningful yet not intuitive inter-family aspects combining radio with TCP, TCP window withpacket anomaly, etc.

The corresponding C2 trees trained with TTrees and STreesare (partially) reported in Fig. 10. The same four families ofidentifiers are deemed as relevant by supervised and unsuper-vised approaches: Radio, TCP-based, RTT, Packet anomaly.TTrees detects a dependency in Aspect 4 that is negativelycorrelated: when the value of Abs RTT VolStep 630KB ishigher than 87.909 ms, the throughput is going to be lower.This means that channel capacity will be the limiting factorcausing buffering and we have a negatively correlated relationin Aspect 4. Now, if we look at the right side of Fig. 10,STrees is describing low RTT and high packet loss as well,deriving that result from the same performance indicator

TABLE II: TTrees performance with different datasetsClassifier Accuracy Precision Recall F1

Dataset #1 C1 82.5% NA NA NAHTTP C2 (MI) 92.0% NA NA NAfile DL C2 (JMI) 91.0% NA NA NA(Nokia) C2 (Mscd) 90.0% NA NA NA

Dataset #2 C1 82.6% 80.0% 81.5% 80.0HTTP C2 (MI) - - - -

livepage DL C2 (JMI) - - - -(Nokia) C2 (Mscd) 88.0% 85.0% 86.0% 84.0

Dataset #3 C1 92.0% 90.0% 91.0% 89.0HTTP C2 (MI) 95.0% 92.6% 94.4% 92.5file DL C2 (JMI) 95% 94.5% 94.5% 92.7

(MONROE) C2 (Mscd) 95.0% 93.0% 94.4% 93.5

plus radio. Furthermore, if we look at the whole tree, theyboth convey similar results. In practice, TTrees and STreesare comparable and show similar classification, taking intoaccount the difference seen in the aspect selected. For instance,the root node is the same: Abs RTT avg is less or equalto 122.432 ms, but the tree leafs can not be identical. Bothresults are usable by an expert to identify manageable setsof similar tests for troubleshooting. Additionally, the tree rulewould allow correct assessments to assign each group to theright troubleshooting department.

The overall accuracy score of C2 for both approachesis 90%, which means that C2 can identify 90% of non-anomalies and anomalies alongside their root cause. With theabove simple example, we have defined an automated anomalydetection system that is functional to improve the qualityof the networking service. Indeed, TTrees can self-identifynetwork performance issues. So it can be used to alert theright support personnel, either the TCP or radio optimizationteam, in the presented example, which will receive as input therules from Fig. 10 that raised the alarm. Acknowledging thatother classification approaches might yield higher accuracy,we remark that would occur at the expense of interpretability.

D. Performance EvaluationTTrees has been executed in the three different datasets.

The target KPI for Dataset #2 is Average Session Duration,discretized into six classes, and for Dataset #3 the KPI isThroughput, discretized into eight classes. Since the newdatasets are large enough for TTrees, we split them intotraining and test subsets, with 10% and 20% testing data whileapplying cross validation 30 times, respectively. Tests are donein classifiers C1 and C2 to get the metrics shown in Table II.The number of anomalous scenarios is 1152 and 6902, with25 and 36 most relevant performance indicators, respectively.TTrees with Mscd was able to identify similar network aspectsin Datasets #2 and #3 as it was in Dataset #1: radio, TCP,and RTT. Unfortunately, the selection of the most relevantindicators did not work well using MI and JMI for Dataset #2(HTTP livepage DL), since all the pairs of indicators selectedwere redundant, and TTrees could not build C2 (hence theentries “-” in Table II). This was not surprising to our expert,since the anomalous scenarios of Dataset #2 are very hard toclassify, since they contain downloads of webpages of multiplesizes. The fact that Mscd found meaningful network aspectsis indeed impressive. For Dataset #1 and #3, all the threeC2 classifiers show similar classification performance (seeTable II). However, the use of JMI and MI did not produce

Page 10: TTrees: Automated Classification of Causes of Network ...

aspects easily identifiable by an expert. In summary, usingMscd leads to better performance than the other two metrics,reaching good accuracy scores while producing meaningfulsets of rules for troubleshooting.

VI. CONCLUSIONS

In this paper, we have shown that it is possible to useinterpretable ML to automatically identify the networkingaspects that cause anomalous behaviors. Such anomalies can-not be immediately and directly explained by observing thenetwork performance indicators. Their large number, even inthe presence of a limited number of samples per indicator,makes it impossible to manually inspect and correlate. Wehave purposely developed a radically novel methodology andimplemented it with Python, in TTrees. Specifically, we haveleveraged the key observation that if a network behaviorcannot be modeled (or learned by ML), it is a symptom ofnetwork anomaly. We therefore easily identify performanceanomalies, after which we are able to use unsupervised MLand Kolmogorov complexity-inspired tools to smartly searchfor the aspects that are more relevant to explain where inthe network protocols the anomaly is rooted. In particular wehave introduced a novel metric, miscoding, for the evaluationof redundancy and relevance of performance indicators. Thefinal outcome of our proposed methodology is a set of easy-to-interpret classification rules, that, in case of performanceanomaly, allow to automatically alert the appropriate depart-ments for corrective actions. To do so, TTrees only needs littlevolumes of samples for the performance indicators. Indeed,TTrees allows to fully automatize network troubleshootingwith high accuracy and fast training, as shown in this paperwith the help of real data gathered from operational cellularnetworks in various countries.

ACKNOWLEDGMENTSPartially supported by Regional Government of Madrid

(CM) grant EdgeData-CM(P2018/TCS4499, cofunded by FSE& FEDER) and Spanish Ministry of Science and Innovationgrant ECID (PID2019-109805RB-I00, cofunded by FEDER).The work of V. Mancuso was supported by the Ramon y Cajalgrant (ref: RYC-2014-16285) from the Spanish Ministry ofEconomy and Competitiveness.

REFERENCES

[1] P. Yang, Y. Xiao, M. Xiao, and S. Li, “6G wireless communications:Vision and potential techniques,” IEEE Network, vol. 33, no. 4, pp. 70–75, July 2019.

[2] J. P. Santos, R. Alheiro, L. Andrade, A. L. Valdivieso-Caraguay, L. I.Barona-Lopez, M. A. Sotelo-Monge, L. J. Garcia-Villalba, W. Jiang,H. Schotten, J. M. Alcaraz-Calero, Q. Wang, and M. J. Barros, “SELF-NET framework self-healing capabilities for 5G mobile networks,”Trans. Emerg. Telecommun. Technol., vol. 27, no. 9, p. 1225–1232, Sep.2016.

[3] A. Asghar, H. Farooq, and A. Imran, “Self-healing in emerging cellularnetworks: Review, challenges, and research directions,” IEEE Commu-nications Surveys & Tutorials, vol. 20, no. 3, pp. 1682–1709, Apr. 2018.

[4] O. Loyola-Gonzalez, “Black-box vs. white-box: Understanding theiradvantages and weaknesses from a practical point of view,” IEEE Access,vol. 7, pp. 154 096–154 113, Oct. 2019.

[5] A. Holzinger, M. Plass, K. Holzinger, G. Crisan, C. Pintea, andV. Palade, “A glass-box interactive machine learning approach forsolving NP-hard problems with the human-in-the-loop,” arXiv preprintarXiv:1708.01104, 2017.

[6] F. K. Dosilovic, M. Brcic, and N. Hlupic, “Explainable artificial intelli-gence: A survey,” in MIPRO, 2018.

[7] T. Ogata, A. Takeuchi, S. Fukuda, T. Yamada, T. Ochi, K. Inoue, andJ. Ota, “Characteristics of skilled and unskilled system engineers introubleshooting for network systems,” IEEE Access, vol. 8, pp. 80 779–80 791, Apr. 2020.

[8] N. Bostrom and E. Yudkowsky, “The ethics of artificial intelligence,”The Cambridge handbook of artificial intelligence, vol. 1, pp. 316–334,Jul. 2014.

[9] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander-plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-esnay, “Scikit-learn: Machine learning in Python,” Journal of MachineLearning Research, vol. 12, pp. 2825–2830, Oct. 2011.

[10] O. Alay, A. Lutu, M. Peon-Quiros, V. Mancuso, T. Hirsch, K. Evensen,A. Hansen, S. Alfredsson, J. Karlsson, A. Brunstrom, A. Safari Kha-touni, M. Mellia, and M. A. Marsan, “Experience: An open platformfor experimentation with commercial mobile broadband networks,” inACM MobiCom, 2017.

[11] R. Barco, L. Nielsen, R. Guerrero, G. Hylander, and S. Patel, “Au-tomated troubleshooting of a mobile communication network usingbayesian networks,” in MWCN, 2002.

[12] R. M. Khanafer, B. Solana, J. Triola, R. Barco, L. Moltsen, Z. Alt-man, and P. Lazaro, “Automated diagnosis for UMTS networks usingBayesian network approach,” IEEE Transactions on Vehicular Technol-ogy, vol. 57, no. 4, pp. 2451–2461, July 2008.

[13] S. Rezaei, H. Radmanesh, P. Alavizadeh, H. Nikoofar, and F. Lahouti,“Automatic fault detection and diagnosis in cellular networks usingoperations support systems data,” in IEEE/IFIP NOMS, 2016.

[14] E. J. Khatib, R. Barco, A. Gomez-Andrades, and I. Serrano, “Diagnosisbased on genetic fuzzy algorithms for LTE self-healing,” IEEE Trans.on Vehicular Technology, vol. 65, no. 3, pp. 1639–1651, March 2016.

[15] G. Ciocarlie, U. Lindqvist, K. Nitz, S. Novaczki, and H. Sanneck, “Onthe feasibility of deploying cell anomaly detection in operational cellularnetworks,” in IEEE/IFIP NOMS, 2014.

[16] Q. Liao and S. Stanczak, “Network state awareness and proactiveanomaly detection in self-organizing networks,” in IEEE Globecom,2015.

[17] P. D. Grunwald, The Minimum Description Length Principle (AdaptiveComputation and Machine Learning). The MIT Press, 2007.

[18] C. S. Wallace and D. L. Dowe, “Minimum Message Length andKolmogorov Complexity,” The Computer Journal, vol. 42, no. 4, pp.270–283, Jan. 1999.

[19] Y. Yang and G. I. Webb, “Discretization for naive-bayes learning:managing discretization bias and variance,” Machine learning, vol. 74,no. 1, pp. 39–74, Jan. 2009.

[20] S. Garcıa, J. Luengo, J. A. Saez, V. Lopez, and F. Herrera, “Asurvey of discretization techniques: Taxonomy and empirical analysisin supervised learning,” IEEE Transactions on Knowledge and DataEngineering, vol. 25, no. 4, pp. 734–750, Apr. 2013.

[21] D. V. Carvalho, E. M. Pereira, and J. S. Cardoso, “Machine learninginterpretability: A survey on methods and metrics,” Electronics, vol. 8,no. 8, p. 832, Jul. 2019.

[22] L. Breiman, J. Friedman, R. Olshen, and C. Stone, “Classification andregression trees.” Kluwer Academic Publishers, New York, 1984.

[23] T. Hastie, R. Tibshirani, and J. Friedman, The elements of statisticallearning: data mining, inference and prediction. Springer, 2009.

[24] W.-Y. Loh, “Fifty years of classification and regression trees,” Interna-tional Statistical Review, vol. 82, no. 3, pp. 329–348, Dec. 2014.

[25] H. Peng, F. Long, and C. Ding, “Feature selection based on mutualinformation: Criteria of max-dependency, max-relevance, and min-redundancy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 8,p. 1226–1238, Aug. 2005.

[26] G. Brown, A. Pocock, M.-J. Zhao, and M. Lujan, “Conditional likelihoodmaximisation: A unifying framework for information theoretic featureselection,” Journal of Mach. Learn. Res., vol. 13, p. 27–66, Jan. 2012.

[27] R. Cilibrasi and P. M. Vitanyi, “Clustering by compression,” IEEE Trans.on Information theory, vol. 51, no. 4, pp. 1523–1545, Apr. 2005.

[28] D. C. Hoaglin, F. Mosteller, and J. W. T. (Editor), Understanding Robustand Exploratory Data Analysis, 1st ed. Wiley-Interscience, 2000.

[29] D. Arthur and S. Vassilvitskii, “K-means++: The advantages of carefulseeding,” in ACM SODA, 2007.

[30] P. M. B. Vitanyi, F. J. Balbach, R. L. Cilibrasi, and M. Li, “Normalizedinformation distance,” arXiv preprint arXiv:0809.2553, 2008.

[31] V. Mancuso, M. Peon Quiros, C. Midoglu, M. Moulay, V. Comite,A. Lutu, O. Alay, S. Alfredsson, M. Rajiullah, A. Brunstrom, M. Mellia,A. Safari Khatouni, and T. Hirsch, “Results from running an experimentas a service platform for mobile broadband networks in Europe,”Computer Communications, vol. 133, pp. 89–101, Jan. 2019.