Efï¬cient Fault Diagnosis Using Incremental Alarm Correlation and

1

Efficient Fault Diagnosis Using Incremental AlarmCorrelation and Active Investigation for Internet and

Overlay NetworksYongning Tang, Member, IEEE, Ehab Al-Shaer, Member, IEEE, and Raouf Boutaba, Senior Member, IEEE

Abstract—Fault localization is the core element in fault man-agement. Symptom-Fault map is commonly used to describe theSymptom-Fault causality in fault reasoning. For Internet servicenetworks, a well-designed monitoring system can effectivelycorrelate the observable symptoms (i.e., alarms) with the criticalnetwork faults (e.g., link failure). However, the lost and spurioussymptoms can significantly degrade the performance and accu-racy of a passive fault localization system. For overlay networks,due to limited underlying network accessibility, as well as theoverlay scalability and dynamics, it is impractical to build a staticoverlay Symptom-Fault map. In this paper, we firstly proposea novel Active Integrated fault Reasoning (AIR) framework toincrementally incorporate active investigation actions into thepassive fault reasoning process based on an extended Symptom-Fault-Action (SFA) model. Secondly, we propose an OverlayNetwork Profile (ONP) to facilitate the dynamic creation of anOverlay Symptom-Fault-Action (called O-SFA) model, such thatthe AIR framework can be applied seamlessly to overlay networks(called O-AIR). As a result, the corresponding fault reasoning andaction selection algorithms are elaborated. Extensive simulationsand Internet experiments show that AIR and O-AIR can sig-nificantly improve both accuracy and performance in the faultreasoning for Internet and Overlay service networks, especiallywhen the ratio of the lost and spurious symptoms is high.

Index Terms—fault localization, symptom-fault map, faultreasoning, overlay networks.

I. INTRODUCTION

Fault localization is the core element in the fault manage-ment system, because it identifies the fault causes that canbest explain the observed network symptoms (i.e. alarms).Most fault reasoning algorithms use a bipartite directed acyclicgraph to describe the Symptom-Fault correlation, which rep-resents the causal relationship between each fault fi and aset of its observed symptoms Sfi ([9]). The Symptom-Faultcausality graph provides a vector of correlation likelihoodmeasure p(si|fi), to bind a fault fi to a set of its symptomsSfi .

For an Internet Service Provider (e.g., AT&T), a well-designed network monitoring system can be deployed to con-tinuously monitor the critical network elements (e.g., routers)for certain network behaviors (e.g., link connectivity) andraise alarms in the event of a failure. The corresponding

Y. Tang is with the School of Information Technology, Illinois StateUniversity, Normal, IL USA. Email: [email protected].

E. Al-Shaer is with the School of Computer Science, Telecommunicationsand Information Systems, DePaul University, Chicago, IL, USA. Email:[email protected].

R. Boutaba is with the David R. Cheriton School of Computer Science,University of Waterloo, Waterloo Canada. Email: [email protected].

Symptom-Fault map can be used to reflect such an intrinsicrelationship between the symptoms (i.e., alarms) and thecorresponding faults (e.g., link failure). Two approaches arecommonly used in fault reasoning and localization: passivediagnosis ([5], [8], [9], [12] and active investigation ([6],[10], [11], [15]). In the passive approach, all symptoms arepassively collected and then processed to infer the root faults.In the active approach, faults are detected by conducting aset of investigation actions. The passive approach causes lessnetwork intrusiveness; however, it may take a long time todiscover the root faults, particularly if the symptom loss ratio(SLR) is high. Here, SLR = (S − SO)/S, where SO is theobserved symptoms and S is the total generated symptoms.Although the active investigation approach is more efficientto identify faults quickly, active investigation (e.g., probing)might cause significant overhead particularly in large-scalenetworks.

For an Overlay Service Provider (e.g., Akamai), because ofcommercial reasons, critical information (e.g., network faultstatistics) from ISPs is not shareable to an OSP. Moreover,due to the dynamics and scalability of overlay networks, it isimpractical for an OSP to construct a static Symptom-Faultmap to facilitate fault localization in overlay networks.

In this paper, we address the fault localization problemfor two common but different networks: Internet and Overlayservice networks. Firstly, we propose a novel fault localiza-tion technique that integrates the advantage of both passiveand active fault reasoning into one framework, called ActiveIntegrated fault Reasoning or AIR. The goal of AIR is tobalance the use of active and passive measurements. Activeinvestigation actions are properly selected only if the passivereasoning is not sufficient. If there are too many passivesymptoms that might explain most of the faults, then the use ofactive measurements will be automatically reduced. Secondly,to tackle the new challenges (e.g., inaccessible underlyingnetwork information and overlay network dynamics) in overlayfault localization, we propose to build an Overlay NetworkProfile (ONP) to facilitate overlay fault localization. Basedon the ONP, we introduce a dynamic Overlay Symptom-Fault-Action (O-SFA) model for overlay networks to incor-porate the AIR framework into the overlay fault localizationsystem (called O-AIR). Our approach significantly improvesthe performance of fault localization while minimizing theintrusiveness of the active fault reasoning.

The paper is organized as follows. After discussing the re-lated work in Section II, we introduce our research motivation

s 1 s 3 s 2

f 1 f 2 f 3

a 1 a 4 a 3 a 2

s 4

.9 .3

.3 .3

.3

.9

.9

.3

.01 .01 .02

Fig. 1. Symptom-Fault-Action Model

and the problem formalization in section III. In section IV, wedescribe the components and algorithms of AIR. In Section V,we introduce the Overlay Network Profile and the dynamicOverlay Symptom-Fault-Action model. Then we discuss theoverlay fault localization framework called O-AIR. In Sec-tion VI, we present our simulations and Internet experimentsto evaluate AIR and O-AIR performance and accuracy for bothInternet and Overlay service networks. In section VII, weconclude the paper and discuss our future work.

II. RELATED WORK

There is a significant amount of work has been done in thearea of fault localization, for both Internet and Overlay net-works. Accordingly, we classify the corresponding solutionsinto Administrator-Level and User-Level fault localization.

A. Administrator-Level Fault Localization.

For a single administratively controlled network such as anISP network, the Symptom-Fault causality model is commonlyused to infer the root faults based on the observation ofnetwork disorders. In the following, we classify this type ofrelated work into two categories:

Passive Approach: Passive fault management techniquestypically depend on monitoring agents to detect and reportnetwork abnormalities using alarms or symptom events. Theseevents are then analyzed and correlated in order to find theroot faults. Various event correlation models were proposedincluding rule-based analyzing systems [17], case-based di-agnosing systems, and model-based systems [18]. Differenttechniques were also introduced to improve the performance,accuracy and resilience of fault localization. In [12], a model-based event correlation engine is designed for multi-layerfault diagnosis. In [5], the coding approach is applied to adeterministic model to reduce the reasoning time and improvethe system resilience. A novel incremental event-driven faultreasoning technique is presented in [8] and [9] to improvethe robustness of a fault localization system by analyzing lost,positive and spurious symptoms.

The techniques above were developed based on the pas-sively received symptoms. If the symptoms are collected cor-rectly, the fault reasoning results can be accurate. However, inreal systems, the symptom loss or spurious symptoms (called

observation noise) are unavoidable. Even with a good strategy([9]) to deal with the observation noise, those techniques stillonly have limited resilience to such observation noise becauseof the passive fault analysis, which might also increase thefault detection time.

Active Approach: Recently, researchers have proposed sev-eral active fault localization approaches. In [11], an activeprobing fault localization system is introduced, in which pre-planned active probes are associated with system status usinga dependency matrix. An on-line action selection algorithmis studied in [10] to optimize action selection. In [15], afault detection and resolution system is proposed for largedistributed transaction processing systems. In [3], a ComputingUtility profit maximization problem is formulated as a MarkovDecision Process (MDP), which shows an useful model toformalize comprehensive action selection problems.

The active approach is more efficient in locating faults ina timely fashion and more resilient to the observation noise.However, this approach lacks a scalable technique that candeal with multiple simultaneous faults. It also cannot easilyisolate intermittent network faults and performance-relatedfaults because it solely depends on the active probing model.In this approach, the number of required probes might beincreased exponentially to the number of possible faults ([10]).

Both passive and active approaches have their own good fea-tures and limitations. Thus, incorporating active investigationactions into the passive fault reasoning approach is an idealframework. AIR combines the good features of both passiveand active approaches and overcomes their limitations byoptimizing the fault reasoning result and the action selectionprocess.

B. User-Level Fault Localization.

Recently, more research has focused on user-level networkmeasurement and fault diagnosis tools/approaches, particularlyfor overlay networks. In the following, we classify them asdiagnosis tools and frameworks.

Diagnosis Tools: Many end-to-end traffic measurement toolswere proposed for monitoring packet loss and other pathproperties for problem diagnosis such as Sting, Cing andTulip. These tools are good for diagnosing a specific networkproperty but are not adequate as a general problem diagnosisin overlay networks. Most of these techniques cause extensiveintrusiveness due to active probings. Moreover, they may notdiscover intermittent problems.

Diagnosis Framework: One of the most interesting recentworks is the Tomography-based approach that estimates net-work performance parameters based on traffic measurement ona limited subset of overlay nodes [21], [22]. However, similarto previous tools, it is still a purely active approach that usuallyrequires extensive probings in order to achieve accurate resultsno matter whether the problems exist or not. On the other hand,PlanetSeer( [23]), taking the Multiple Vantage Point Approachlocates Internet faults by selectively and periodically invoking”traceroute” from multiple vantage points. The measurementmodel is manually managed and only matches the applicationdomain directions of data flow.

To the best of our knowledge, our proposed O-AIR isthe first comprehensive user-level probabilistic overlay faultdiagnosis framework that integrates active monitoring withpassive fault reasoning based on the dynamic generated Over-lay Symptom-Fault-Action model.

III. MOTIVATION AND PROBLEMFORMALIZATION

For traditional Internet service networks, active fault man-agement does not scale well when the number of nodes orfaults grows significantly in the network. In fact, some faults,such as the intermittent reachability problem, may not even beidentified if only active fault management is used. However,this can easily be reported using passive fault managementsystems because agents are configured to report abnormalsystem conditions or symptoms, such as a high average packetdrop ratio. On the other hand, symptoms can be lost due tonoisy or unreliable communications channels, or they mightbe corrupted due to spurious (untrue) symptoms generated asa result of malfunctioning agents or devices. This significantlyreduces the accuracy and the performance of passive fault lo-calization. Only the integration of active and passive reasoningcan provide efficient fault localization solutions.

To incorporate actions into a traditional Symptom-Faultmodel, we propose an extended Symptom-Fault-Action (SFA)model as shown in Fig. 1. In our model, actions are properlyselected probes or test transactions that are used to detector verify the existence of observable symptoms. Actions cansimply include commonly used network utilities, like pingand traceroute; or some proprietary fault management system(e.g., [6]. We assume that symptoms are verifiable, whichmeans that if the symptom ever occurred, we could verify thesymptom existence by executing some investigation actions(e.g., probing) or checking the system status through, forexample, system logs.

In this paper, we use F = {f1, f2, . . . , fn} to denote thefault set, and S = {s1, s2, . . . , sm} to denote the symptomset that can be caused by one or multiple faults in F . Thecausality matrix PF×S = {p(si|fj)} is used to define causalcertainty between fault fi(fi ∈ F ) and symptom si(si ∈ S). Ifp(si|fj) = 0 or 1 for all (i, j), we call such a causality modela deterministic model; otherwise, we call it a probabilisticmodel. We also use A = {a1, . . . , ak} to denote the list ofactions that can be used to verify symptom existence. Wedescribe the relation between actions and symptoms usingAction Codebook represented as a bipartite graph as shownin Fig. 1. For example, the symptom s1 can be verifiedusing action a1 or a2. The Action Codebook can be definedby network managers based on symptom type, the networktopology, and the available fault diagnostic tools. The extendedSymptom-Fault-Action (SFA) graph is viewed as a 5-tuple(S, F, A,E1, E2), where fault set F , symptom set S, andaction set A are three independent vertex sets. Every edge inE1 connects a vertex in S and another vertex in F to indicatea causality relationship between symptoms and faults. Everyedge in E2 connects a vertex in A and another vertex in Sto indicate the Action Codebook. For convenience, in Fig. 2,

Fidelity Evaluation ( FE )

Fault Reasoning ( FR )

Action Selection ( AS )

Fidelity satisfied ?

Symptoms Verified?

Y

N

N

Y

Conclusion (high credible h found)

Fig. 3. Active Action Integrated Fault Reasoning

we introduce the notations used in our discussion on ActiveIntegrated Fault Reasoning. The basic Symptom-Fault-Actionmodel can be described as follows:

• For every action, associate an action vertex ai, ai ∈ A;• For every symptom, associate a symptom vertex si, si ∈

S;• For every fault, associate a fault vertex fi, fi ∈ F ;• For every fault fi, associate an edge to each si caused

by this fault with a weight equal to p(si|fi);• For every action ai, associate an edge of weight equal to

the action cost to each symptom verifiable by this action.

Performance and accuracy are the two most important fac-tors for evaluating fault localization techniques. Performance ismeasured by fault detection time T , which is the time betweenreceiving the fault symptoms and identifying the root faults.The fault diagnostic accuracy depends on two factors: (1) thedetection ratio (α), which is the ratio of the number of truedetected root faults (Fd is the total detected fault set) to thenumber of actual occurred faults Fh, formally α = |Fd∩Fh|

|Fh| ;and (2) the false positive ratio (β), which is the ratio of thenumber of false reported faults to the total number of detectedfaults, formally β = |Fd−Fd∩Fh|

|Fd| [9]. Therefore, the goal ofany fault management system is to increase α and reduce βin order to achieve highly accurate fault reasoning results.

The task of fault reasoning is to search for root faults inF based on the observed symptoms SO. Our objective is toimprove fault reasoning by minimizing the detection time, Tand the false positive ratio, β, and by maximizing the detectionratio, α.

In order to develop this system, we have to address thefollowing three problems: (1) Given the Fault-Symptom cor-relation matrix and the set of observed symptoms (SO),construct a set of the most possible hypotheses, Φ ={h1, h2, . . . , hp}, hi ⊆ F , that can explain the current ob-served symptoms; (2) Given a set of possible hypotheses,find the most credible hypothesis h, that can give the bestexplanation for the current observed symptoms; (3) If the se-lected hypothesis does not satisfy the fidelity requirement, then

Notation DefinitionSfi

a set of all symptoms caused by the fault fi

Fsi a set of all faults that might cause symptom si

SO a set of all observed symptoms so farSOi

a set of observed symptoms caused by fault fi

SUia set of not-yet-observed (lost) symptoms caused by the fault fi

hi a set of faults that constitute a possible hypothesis that can explain SO

Φ a set of all different fault hypotheses, hi, that can explain SO

SN a set of correlated but not-yet-observed symptoms associated with any fault in a hypothesisSV a subset of SN , which includes symptoms the existence of which is confirmedSU a subset of SN , which includes symptoms the non-existence of which is confirmed

Fig. 2. Active Integrated Fault Reasoning Notation

given the unobserved symptoms SN to select the minimum-cost actions to search for an acceptable hypothesis.

Compared to Internet service networks, overlay networkshave emerged as a powerful and flexible platform for devel-oping new disruptive network applications. We believe thefollowing new characteristics, challenges, and service objec-tives suggest that overlay fault localization has to adopt a newapproach:

• Inaccessible underlying network information and incom-plete network status observation: Overlay network in-frastructure is owned and controlled by Internet ser-vice providers (ISPs); however, overlay services areprovisioned, operated and monitored by overlay serviceproviders (OSPs). Overlay networks are formed by co-ordinated overlay nodes on top of opaque underlyingnetworks. It is infeasible to know the prior fault prob-ability of these underlying components and encode aprobabilistic relationship between underlying componentsand user-level observations. An overlay fault diagnosistechnique must be developed based on incomplete andinsufficient user observations without relying on an un-derlying network fault probabilistic model as in [7][4][9].

• Multi-layer complexity and dynamic Symptom-Faultcausality: The flexibility and dynamics of overlay nodesand links make the interaction (also correlation) be-tween overlay and underlying networks unpredictable.In overlay networks, observed symptoms are usually notdesigned and collected for monitoring specific faults (e.g.,malfunctioning network router interfaces). Symptom-Fault causality relationship is dynamic and unpredictablein overlay networks.

• Fault reasoning goal and granularity: The goal in tra-ditional fault reasoning is to accurately locate the faultycomponents and fix them. However, in overlay network, itbecomes more effective for overlay applications to avoiddetected faulty components instead of fixing them. Thus,for overlay fault reasoning, coarse-grained fault reasoningis acceptable and may be more preferable.

In the following, we will first discuss the solution for theabove three issues in the fault localization of Internet ServiceNetworks. Then, we will propose the solution to address thenew challenges in the fault localization of Overlay ServiceNetworks.

IV. ACTIVE INTEGRATED FAULT REASONING

The Active Integrated Fault Reasoning (AIR) process(Fig. 3) includes three functional modules: Fault Reasoning(FR), Fidelity Evaluation (FE), and Action Selection (AS).The Fault Reasoning module takes passively observed symp-toms SO as input and returns the fault hypothesis set Φ asoutput. The fault hypothesis set Φ might include a set ofhypotheses (h1, h2, . . . , hn), where each one contains a setof faults that explains all observed symptoms up to that point.Then, Φ is sent to the Fidelity Evaluation module to check ifany hypothesis hi (hi ∈ Φ) is satisfactory. If most correlatedsymptoms necessary to explain the fault hypothesis hi areobserved (i.e., high fidelity), then the Fault Reasoning processterminates. Otherwise, a list of unobserved symptoms, SN ,that contribute to explain the fault hypothesis hi of the highestfidelity is sent to the Action Selection module to determinewhich symptoms have occurred. As a result, the fidelityvalue of hypothesis hi is adjusted accordingly. The conductedactions return the test result with a set of existing symptomsSV and non-existing symptoms SU . The corresponding fidelityvalue might be increased or decreased based on the actionreturn results. If the newly calculated fidelity is satisfactory,then the reasoning process terminates; otherwise, SO, SU aresent as new input to the Fault Reasoning module to create anew hypothesis. This process is repeated until a hypothesiswith high fidelity is found. Fidelity calculation is explainedlater in this section. In the next section, we describe the threemodules in detail, then discuss the complete Active IntegratedFault Reasoning algorithm.

A. Heuristic Algorithm for Fault Reasoning

In the Fault Reasoning module, we use a contributionfunction, C(fi), as a criterion to find faults that have themaximal contribution of the observed symptoms. In the prob-abilistic model, symptom si can be caused by a set of faultsfi, (fi ∈ Fsi) with different possibilities p(si|fi) ∈ (0, 1]. Weassume that the Symptom-Fault correlation model is sufficientenough to neglect other undocumented faults and symptoms(i.e., prior fault/symptom probability is very low). Thus, wecan also assume that symptom si will not occur if none of thefaults in Fsi happened. In other words, if si occurred, at leastone fi (fi ∈ Fsi) must have occurred. However, conditionalprobability p(si|fi) itself may not truly reflect the chance offault fi occurrence by observing symptom si. For example,in Fig. 1, there are three possible scenarios that can result

from observing s1: f1 can happen, f2 can happen or bothcan happen. Based on the common heuristic assumption thatthe possibility of multiple faults happening simultaneouslyis low [9], one of the faults (f1 or f2) should explain theoccurrence of s1. In order to measure the contribution of eachfault fi to the creation of si, we normalize the conditionalprobability p(si|fi) to the normalized conditional probabilityτ(si|fi) to reflect the relative contribution of each fault fi tothe observation of si.

τ(si|fi) =p(si|fi)∑

fi∈Fsip(si|fi)

(1)

With τ(si|fi), we can compute normalized posterior probabil-ity µ(fi|si) as follows.

µ(fi|si) =τ(si|fi)p(fi)∑

fi∈Fsiτ(si|fi)p(fi)

(2)

µ(fi|si) shows the relative probability of fi happening byobserving si. For example, in Fig. 1, assuming all faults havethe same prior probability, then µ(f1|s1) = 0.9/(0.9+0.3) =0.75 and µ(f2|s1) = 0.3/(0.9 + 0.3) = 0.25. The followingcontribution function C(fi) evaluates all contribution factorsµ(fi|si), si ∈ SOi with the observation SOi , and decideswhich fi is the best candidate with maximum contributionvalue C(fi) to the currently not yet explained symptoms.

C(fi) =

∑si∈SOi

µ(fi|si)∑si∈Sfi

µ(fi|si)(3)

Therefore, fault reasoning becomes a process of searchingfor the fault (fi) with maximum C(fi). This process continuesuntil all observed symptoms are explained. The contributionfunction C(fi) can be used for both the deterministic andprobabilistic models.

In the deterministic model, the higher the number of symp-toms observed, the stronger the indication that the correspond-ing fault has occurred. Meanwhile, we should not ignore theinfluence of prior fault probability p(fi), which representslong-term statistical observation. Since p(si|fj) = 0 or 1 inthe deterministic model, the normalized conditional probabilityreflects the influence of prior probability of fault fi. Thus, thesame contribution function can seamlessly combine the effectof p(fi) and the ratio of |SOi

||Sfi

| together.The fault reasoning algorithm first finds the fault candidate

set FC , including all faults that can explain at least onesymptom si (si ∈ SO); then it calls the function HU() togenerate and update the hypothesis set Φ until all observedsymptoms SO can be explained. According to the contributionC(fi) of each fault fi (fi ∈ FC), Algorithm 1 iterativelysearches for the best explanation (i.e. the fault with the highestcontribution) (lines 5-6) of SK , which are currently observedsymptoms not yet explained by the hypothesis hi (lines 4-12). Here SK = SO − ∪fi∈hiSOi and initially SK = SO. Ifmultiple faults have the same contribution, multiple hypotheseswill be generated (lines 13-17). The searching process (HU )will recursively run until all observed symptoms (i.e., SKi)are explained (i.e., SKi = ∅) (lines 18-24). Notice that only

Algorithm 1 Hypothesis Updating Algorithm HU(h, SK , FC)Input: hypothesis h, observed but uncovered symptom set SK , fault candidateset FC

Output: fault hypothesis set Φ

1: cmax = 02: for all fi ∈ FC do3: if C(fi) > cmax then4: cmax ← C(fi)5: FS ← ∅6: FS ← FS ∪ {fi}7: else8: if C(fi) = cmax then9: FS ← FS ∪ {fi}

10: end if11: end if12: end for13: for all fi ∈ FS do14: hi ← h ∪ {fi}15: SKi

← SK − SOi

16: FCi← FC − {fi}

17: end for18: for all SKi

do19: if SKi

= ∅ then20: Φ ← Φ ∪ {hi}21: end if22: end for23: if Φ 6= ∅ then24: return < Φ >25: else26: /* No hi can explain all SO*/27: for all hi do28: HU(hi, SKi

, FCi)

29: end for30: end if

those hypotheses with a minimum number of faults that coverall observed symptoms are included into Φ (lines 23-24).

The above Fault Reasoning algorithm can be applied toboth deterministic and probabilistic models with the same con-tribution function C(fi) but different conditional probabilityp(si|fi).

B. Fidelity Evaluation of Fault Hypotheses

The fault hypotheses created by the Fault Reasoning algo-rithm may not accurately determine the root faults because oflost or spurious symptoms. The task of the Fidelity Evaluationis to measure the credibility of the hypothesis created in thereasoning phase given the corresponding observed symptoms.Objectively evaluating the reasoning result is crucial in faultlocalization systems.

We use the fidelity function FD(h) to measure thecredibility of hypothesis h given the symptom observationSO. We assume that the occurrence of each fault isindependent.

• For the deterministic model:

FD(h) =

∑fi∈h |SOi |/|Sfi |

|h| (4)

• For the probabilistic model:

FD(h) =

∏si∈

⋃fi∈h Sfi

(1−∏fi∈h(1− p(si|fi)))

∏si∈SO

(1−∏fi∈h(1− p(si|fi)))

(5)

Obviously in the deterministic model, if the hypothesis h iscorrect, FD(h) must be equal to 1 because the correspondingsymptoms can be either observed or verified. In the proba-bilistic model, if related symptoms are observed or verified,FD(h) of a credible hypothesis can still be less than 1 becausesome symptoms may not happen even when the hypothesesare correct. In either case, our fidelity algorithm takes intoconsideration a target Fidelity Threshold, FDTH , that the usercan configure to accept the hypothesis. System administratorscan define the threshold based on long-term observation andprevious experience. If the threshold is set too high, evencorrect hypothesis will be ignored; but if the threshold is toolow, then a less credible hypothesis might be selected.

The fidelity evaluation function is used to evaluate eachhypothesis and decide if the result is satisfactory by comparingthem to the pre-defined threshold value. If an acceptablehypothesis that matches the fidelity threshold exists, the faultlocalization process can terminate. Otherwise, the best avail-able hypothesis and a non-empty set of symptoms (SN ) wouldbe verified in order to reach a satisfactory hypothesis in thenext iteration.

C. Action Selection Heuristic Algorithm

The main reason to verify the existence of symptoms ratherthan faults is that symptoms are noticeable/visible conse-quences of faults and thus they are easier to track and verify.The task of Action Selection is to find the least-cost actionsto verify SN (unobserved symptoms) of the hypothesis thathas the highest fidelity. As the size of SN grows very large,the process of selecting the minimal cost action that verifiesSN becomes non-trivial. The Action-Symptoms correlationgraph can be represented as a 3-tuple (A,S, E) graph suchthat A and S are two independent vertex sets representingActions and Symptoms respectively, and every edge e in Econnects a vertex aj ∈ A with a vertex si ∈ S with acorresponding cost (tij) to denote that aj can verify si withcost tij = t(si, aj) > 0. If there is no association betweensi and aj , then tij = 0. Because a set of actions might berequired to verify one symptom, we use a composite actionvertex, vj , to represent this case. The composite action vertexvj is used to associate a set of conjunctive actions to thecorresponding symptom(s). However, if multiple actions aredirectly connected to a symptom, then this means any of theseactions can be used disjunctively to verify this symptom. Torepresent the relationship between the composite actions andthe corresponding symptoms using a bipartite graph, we (1)set the cost of vj , t(si, vj), to the total cost of the conjunctiveaction set; (2) then eliminate the associated conjunctive set tothe vj ; and (3) associate vj with all symptoms that can beverified by any action in the conjunctive action set.

The goal of the Action Selection algorithm is to selectthe actions that cover all symptoms SN with a minimalaction cost. With the representation of the Symptom-Actionbipartite graph, we can model this problem as a weightedset-covering problem. Thus, the Action Selection algorithmsearches for Ai such that Ai includes the set of actions thatcover all the symptoms in the Symptoms-Action correlation

graph with total minimum cost. We can formally define Ai

as the covering set that satisfies the following conditions: (1)∀si ∈ S, ∃aj ∈ Ai s.t. tij > 0, and (2)

∑ai∈Ai,sj∈SN

tij isthe minimum.

The weighted set-covering is an NP-complete problem.Thus, we developed a heuristic greedy set-covering approx-imation algorithm to solve this problem. The main idea of theAlgorithm is simply to first select the action (ai or vi) that hasthe maximum relative covering ratio, Ri = |Sai

|∑sj∈Sai

tij, where

this action is added to the final set Af and removed from thecandidate set Ac that includes all actions. Here, Sai is the setof symptoms that action ai can verify, Sai

⊆ SN . Then, weremove all symptoms that are covered by this selected actionfrom the unobserved symptom set SN . This search continuesto find the next action ai (ai ∈ Ac) that has the maximumratio Ri until all symptoms are covered (i.e., SN is empty).Thus, intuitively, this algorithm appreciates actions that havemore symptom correlation or aggregation. If multiple actionshave the same relative covering ratio, the action with morecovered symptoms (i.e., larger |Sai | size) will be selected.If multiple actions have the same ratio, Ri, and same |Sai |,then each action is considered independently to compute thefinal selected sets for each action and the set that has theminimum cost is selected. Finally, it is important to noticethat each single action in the Af set is necessary for thefault-determination process, because each one covers uniquesymptoms.

D. Algorithm for Active Integrated Fault Reasoning

The main contribution of this work is to incorporate activeactions into fault reasoning. Passive fault reasoning couldwork well if enough symptoms can be observed correctly.However, in most cases, we need to deal with interference fromsymptom loss and spurious symptoms, which could misleadfault localization analysis. As a result of fault reasoning, thegenerated hypothesis suggests a set of selected symptomsSN that are unobserved but expected to happen based onthe highest fidelity hypothesis. If fidelity evaluation of suchhypothesis is not acceptable, optimal actions are selected toverify SN . Action results will either increase fidelity eval-uation of the previous hypothesis or bring new evidence togenerate a new hypothesis. By taking actions selectively, thesystem can evaluate fault hypotheses progressively and reachto root faults.

Algorithm 2 illustrates the complete process of the AIRtechnique. Initially, the system takes observed symptom SO asinput. Fault Reasoning is used to search the best hypothesis Φ(Line 3). Fidelity is the key to associate passive reasoningto active investigation actions. Fidelity Evaluation is usedto measure the correctness of corresponding hypothesis h(h ∈ Φ) and produce expected missing symptoms SN (Line3). If the result h is satisfied, the process terminates withthe current hypothesis as output (Line 5 - 6). Otherwise,AIR waits until the Initial Passive Period (IPP ) has expired(Line 8) to initiate actions to collect more evidence of verifiedsymptoms SV and not-occurred symptoms SU (Line 10). Newevidence will be added to re-evaluate the previous hypothesis

Algorithm 2 Active Integrated Fault Reasoning SO

Input: SO

Output: fault hypothesis h

1: SN ← SO

2: while SN 6= ∅ do3: Φ = FR(SO)4: < h, SN >= FE(Φ)5: if SN = ∅ then6: return < h >7: else8: if IPP expired then9: /*used to schedule active fault localization periodically*/

10: < SV , SU >= AS(SN )11: end if12: end if13: SO ← SO ∪ SV

14: < h, SN >= FE({h})15: if SN = ∅ ‖ SV = ∅ then16: return < h >17: end if18: end while

(Line 13). If fidelity evaluation is still not satisfied, the newevidence from previous observation is used to search foranother hypothesis (Line 3) until the fidelity evaluation issatisfied. At any point, if either the fidelity evaluation doesnot find symptoms to verify (SN is ∅), or none of the verifiedsymptom had occurred (SV is ∅), the program will terminateand return the current selected hypothesis. In either case, thisis an indication that the current selected hypothesis is credible.

V. ACTIVE OVERLAY FAULT LOCALIZATION

In contrast to an Internet Service Provider (ISP) that has fullcontrol over its own networks, An Overlay Service Provider(OSP) provides overlay services via a logical overlay networkinfrastructure, which is built on top of underlying networkscontrolled by different ISPs. As a service provider, an OSPcan monitor its overlay networks by collecting end-to-endsymptoms either passively from overlay users or via end-to-end monitoring actions. It is impractical for an OSP toassociate all observed overlay symptoms with the underlyingnetwork components, and further, to encode the conditionalprobabilistic causality relationship between them. However,please note that from an overlay application standpoint, thereare two types of faults: overlay faults (F o) and underlay faults(Fu). In this paper, we call the overlay nodes related problemsoverlay faults and the underlying network related problemsunderlay faults. An overlay application usually needs to takedifferent countermeasures to tackle the above different typesof faults. For example, if an overlay multicast applicationdetects that the cause of performance degradation is due toan overloaded overlay node, it can simply replace that faultyoverlay node with another one in the same network; however,if the problem is due to the underlying network components(e.g., malfunctioned network routers), the overlay applicationhas to choose another overlay node from different networksso that it can reroute the application traffic to avoid faultyunderlying network components. Thus, one important designobjective of the overlay fault diagnosis system is to effectivelydistinguish overlay faults from underlay faults.

Considering the above objective as well as the challenges

Nae

Nce

Nef

Nfd

Nfb

Overlay Path

Overlay Path

EUN ISP Logic Link Physical Link

3-tuple model for overlay link

Fig. 4. Overlay Multi-layer Illustration and 3-tuple Model

and requirements in overlay fault diagnosis discussed in Sec-tion III, we believe an overlay fault diagnosis system has tobe designed accordingly to aim at the following objectives. Itshould:• diagnose faults across multiple layer abstraction and

effectively distinguish overlay faults from underlay faults;• dynamically create Symptom-Fault correlation;• deal with insufficient user observations.In the following, we first propose an approach to construct

an Overlay Network Profile (ONP) to facilitate overlay faultdiagnosis. Then, we introduce how to dynamically build anOverlay Symptom-Fault-Action (O-SFA) model. Finally, wediscuss the process of Overlay Active Integrated fault Rea-soning (O-AIR). We list in Fig. 7 the relevant notations usedin O-AIR discussion.

A. Overlay Network Profile

There are two types of networks or Autonomous Systems(ASes) in the Internet: Internet Service Provider (ISP) net-works and End-User Networks (EUN). An overlay networkconsists of a set of overlay nodes (No) and a set of logicaloverlay paths (P o). Each overlay path po

i (poi ∈ P o) consists

of one or multiple overlay links loi (loi ∈ Lo, Lo is the set ofoverlay links). Each overlay link (loi ) is a direct network pathbetween a pair of overlay nodes. An overlay link, the basic el-ement in overlay networks, involves at least three components(source and destination overlay nodes, the underlying networkbetween them), and can be abstracted as a 3-tuple model. Forexample, in Fig.4, there are two overlay paths: po

ab betweenoverlay nodes (a, b); and po

cd between overlay nodes (c, d).The overlay path po

ab consists of three overlay links as follows:loae between overlay nodes (a, e); loef between (e, f); and lofb

between (f, b). Each overlay link (e.g., loae) can be representedas a 3-tuple model (a,Nae, e). Here, a and e are overlay nodesand Nae represents the logical underlay network componentbetween a and e. Please note, the granularity of the abstracted3-tuple model of an overlay link is adjustable and depends onthe availability information from different ISPs. In this paper,we use 3-tuple model and assume no collaboration from ISPs.

In this overlay fault diagnosis framework (O-AIR), weassume that every overlay node can have multiple slices (e.g.,Planet-Lab [2]) for hosting multiple overlay applications si-multaneously. One of the slices on each overlay node is calledthe administration slice and is controlled by the Overlay Net-work Operation Center (OvNOC). Multiple overlay applica-tions can run independently and simultaneously. Each overlayapplication may have a dedicated monitoring agent (denotedas oAgent) to monitor various problems (e.g., unreachability or

oAgent 1 oAgent 2 oAgent n

Overlay Network Operation Center

(OvNOC)

O1 = {o11, o12, …...o1k,} O2 = {o21, o22, …...o2m,} On = {on1, on2, …...onq,}

Overlay Networks

Fig. 5. Overlay Monitoring Infrastructure

performance degradation) and send the corresponding end-to-end overlay path symptom (denoted as oi) with the informationof corresponding overlay links as a notification to OvNOC(Fig.5). We assume each oAgent has the knowledge of itsoverlay application topology. For example, as shown in Fig.4,if the oAgent observes an abnormal network behavior betweenoverlay nodes a and b, it sends oab = (sae, sef , sfb) toOvNOC. Here, sij is an overlay link symptom related to loij .If sij is negative, it means that at least one component in thecorresponding 3-tuple model (i, j,Nij) is faulty. Understand-ing the overlay profile is challenging; However, we believethat an overlay service provider can easily obtain statisticalinformation for overlay node faults and observe end-to-end(i.e., overlay link) network symptoms. If we monitor the givenoverlay link (e.g., lae) for a long enough period, we couldobtain the prior fault probabilities of each component of 3-tuple model: p(fa), p(fae), p(fe); and the corresponding con-ditional probabilities: p(sae|fa), p(sae|fae), p(sae|fe). Suchinformation could be aggregated into a central or distributedoverlay knowledge system such that a query regarding theprior/conditional fault probabilities for a specific 3-tuple modelcould be answered. We call such an overlay knowledge systemOverlay Network Profile (ONP).

fCE fA fAE fE fEF fF fFB fB fFD fDfC

sAE sEF sFB sFDsCE

oAB oCDi

0.8 0.7

0.5

0.2 0.1 0.2

i

i

i a1 a2 a3 a4 a5 a6 a7

2

1

2

1

1

1

Fig. 6. Overlay Symptom-Fault-Action Model

B. Dynamic Overlay Symptom-Fault-Action Model

As shown in Fig. 5, the task of the OvNOC is to correlatereceived end-to-end symptoms ({oi}) and identify the rootcauses ({fi}). However, because the observed symptoms may

Notation DefinitionL a set of overlay linksloij an overlay link between overlay node i and j. loij ∈ L

P o a set of overlay pathNij the underlying network between the overlay nodes i and jO a set of overlay path symptoms {o},oij an overlay symptom between the overlay nodes i and jS a set of overlay link symptomsA a set of overlay actions

Fig. 7. Overlay Active Integrated Fault Reasoning Notation

not necessarily be sufficient to identify the existing networkproblem, the OvNOC may have to conduct a set of monitoringactions to determine the problems real-time with minimumcost. Overlay actions (also denoted as A) are used to verifythe existence of the corresponding faults.

The observed symptoms could be positive or negative. Posi-tive symptoms can be used to quickly narrow down the searchspace of root faults by removing all components explained bypositive symptoms. Thus, in the following, we only assumeall received symptoms are negative ones.

Based on received overlay symptoms (oi) and ONP, we candynamically construct an Overlay Symptom-Fault-Action (O-SFA) model (e.g., Fig.6) as the following:• For every observed overlay path (pi) related symptom,

associate a vertex oi (oi ∈ O);• For every overlay link li (li ∈ pi) related symptom,

associate a vertex si (si ∈ S);• For each component in the 3-tuple model of each involved

overlay link, associate a vertex fi (fi ∈ F );• For every overlay action, associate an action vertex ai of

weight 1, ai ∈ A;• For every component in F , associate an edge to its

corresponding overlay link symptom vertex with a weightequal to P (si|fj), which is obtained from ONP;

• For every overlay link symptom vertex si, associate anedge to all relevant overlay path symptom vertex oj thatcontains si

• For every overlay action in A, associate an edge to itscorresponding component in F with a weight equal to theadministrative action cost, which is specified by OvNOCadministrators.

Please note, in an overlay network, all the potential faultsbeing considered are directly observable via some actions (i.e.,the up/down state of every overlay node and network pathbetween them could be independently and directly measured).O-AIR uses passively obtained symptoms to determine whichpotential faults to test first, as a cost optimization. Therefore,the overlay fault diagnosis task can be further defined as thefollowing two sub-tasks: (1) Given a set of received overlayend-to-end symptoms O = {oi}, find a fault hypothesis h thatis comprised of a set of faults (h = {fi}) that can best explainoverlay symptoms O; (2) If h can not be verified after takingverification actions, find a set of actions with the least costthat can lead to find the root faults.

The overlay fault diagnosis (O-AIR) process includes threefunctional modules: Symptom Mining (SM ), Fault Reasoning(FR), and Action Selection (AS). The Symptom Mining

module uses observed overlay symptoms (O) to dynamicallycreate O-SFA based on ONP. The Fault Reasoning moduletakes O-SFA as input and returns a fault hypothesis h as output.The fault hypothesis h contains a set of faulty componentsthat explains all observed symptoms so far. The correspondingoverlay actions are selected to verify the hypothesis. If allfaults in h are verifiable, the overlay fault diagnosis processcan terminate. Otherwise, the action results will be used toupdate previously constructed O-SFA by removing explainedoverlay symptoms, as well as irrelevant components, andadding new symptoms and related components. This processis repeated until a verifiable hypothesis is found.

C. Overlay Fault LocalizationIn the following, because of the similarity with the corre-

sponding nodules in AIR discussed in Section IV, we willintroduce a new overlay contribution function used in theFault Reasoning (FR) module, and briefly discuss Symptomthe Mining (SM) and the Action Selection (AS) Modules.

1) Overlay Symptom Mining: The overlay Symptom-Fault-Action model (O-SFA) needs to be dynamically created andupdated. There are two basic functions conducted in theoverlay Symptom Mining (SM ) module: (1) to collect overlaysymptoms (O) and query ONP; (2) to construct/update O-SFA.The input of SM can be from different overlay applicationmonitoring agents (oAgent), or from the return results of activeinvestigation actions.

2) Overlay Fault Reasoning: Given O-SFA, the next taskis to find the most likely root causes from F , which canbest explain all observed overlay path symptoms O. In theOverlay Fault Reasoning module, we use a Overlay Con-tribution Function, Co(fi), as a criterion to find faults thathave the maximum indications of the observed symptoms.In O-SFA, overlay link symptom si is introduced becauseof the observation of O. si can be caused by at least onecomponent fi, (fi ∈ Fsi) in the 3-tuple model with differentpossibilities p(si|fi) ∈ (0, 1]. We assume that the Symptom-Fault correlation model in O-SFA is sufficient enough toneglect other undocumented faults (i.e., prior fault probabilityis very low). Thus, we can also assume that symptom si willnot occur if no component in 3-tuple model Fsi is faulty.

However, conditional probability p(si|fi) itself may nottruly reflect the chance of fault fi occurrence by observingsymptom si. For example, in Fig. 6, by observing sef , inorder to measure the indication for each fault (fe, fef , ff ) tothe creation of sef , we normalize the conditional probabilityp(si|fi) to the normalized conditional probability τ(si|fi), thesame as Eq. 1, in order to reflect the relative indication of eachfault fi to si.

Similarly, we use Eq. 2 to calculate µ(fi|si), the relativeprobability of fi happening by observing si. For example, inFig. 6, assuming overlay components (i.e., overlay node Eand F) have much higher prior probability (e.g., 20%) thanunderlay components (e.g., Nef with prior fault probability10%), then µ(fe|sef ) = 0.8 ∗ 0.2/(0.8 ∗ 0.2+0.5 ∗ 0.1+0.7 ∗0.2) = 0.45, µ(fef |sef ) = 0.14, and µ(ff |sef ) = 0.41.

Thus, we can measure the prior fault probability of eachrelated overlay link (represented as si) given an overlay path

symptom oi (si ∈ oi) as the following:

σ(si|oi) =

∑fi∈Fsi

τ(si|fi)p(fi)∑si∈Soi

∑fi∈Fsi

τ(si|fi)p(fi)(6)

In O-AIR, we developed the following Overlay ContributionFunction Co(fi) to evaluate all contribution factors µ(fi|si)(si ∈ Sfi) and decide which fi is the best candidate withmaximum indication value Co(fi) to the currently not yetexplained overlay path symptoms. Here, Ofi is the set ofoverlay symptoms that could be caused by fi.

Co(fi) =∑

si∈Sfi,oi∈Ofi

µ(fi|si)σ(si|oi) (7)

Therefore, overlay fault reasoning becomes a process ofsearching for the fault (fi) with maximum Co(fi) such that itcan be included in the corresponding fault hypothesis h. Thisprocess continues until all observed symptoms are explained.We omitted the details of the overlay fault reasoning algorithm,which is similar to Algorithm 1. However, in the overlay faultlocalization process, it is impossible to evaluate the fidelity ofthe hypothesis because of incomplete symptom observation.Instead, the corresponding verification actions need to be takenin order to verify the correctness of the hypothesis.

3) Overlay Action Cost Estimation: There are differentways to choose the action cost. We here provide an intuitiveidea on how to estimate the cost in an overlay environment.The action cost is a function of ”benefit” minus the ”over-head”. The benefit of an action is determined by the criticalityand impact of the associated fault. The criticality reflects theseverity of this fault on business value and SLA which couldbe estimated from previous history or business policies. Thefault impact is used to estimate the magnitude of the damagecaused by this fault (e.g., like number of users/customers hasbeen effected by this fault). The overhead of an action isestimated in term of bandwidth, delay (number of overlaycomponent and underlay links), and labor cost. These factorscan be converted to real cost (e.g., dollars) based on businesspractices and policies and then aggregated to calculate the totalcost of an action. In our approach, the total cost is assumedas a weighted sum of number of normalized cost factors asexplained. We also assume that the cost factors and weightscan be provided by the network administrators, as the action-symptom association.

4) Overlay Action Selection: Depending on the verifiedoverlay faults (e.g. end-to-end packet loss), the correspondingoverlay actions can be selected using simple utilities such asping and traceroute, user scripts of individual tools such assting, sprobe and IPerf, or customized existing measurementframework such as ScriptRoute. Once an overlay fault hypoth-esis is generated, the corresponding overlay actions can beremotely invoked from a management workstation to verifythe corresponding monitored overlay faults. As multiple tools(actions) can be used to verify the same type of overlay faultbut with different cost (e.g. running time, intrusiveness, etc.),we developed an Action Selection module for O-AIR similarto the one described in Section IV-C to search for a set ofoverlay actions with minimal administrative action cost.

(a) (b) (c)

0

10

20

30

40

50

60

70

80

90

100 200 300 400 500 600

Dete

ctio

n tim

e (

s)

Network Size

Active SLR=30% FTH=0.8Active SLR=20% FTH=0.8Active SLR=10% FTH=0.8

Passive SLR=30% FTH=0.35Passive SLR=20% FTH=0.6Passive SLR=10% FTH=0.7

0

0.2

0.4

0.6

0.8

1

100 200 300 400 500 600

Dete

ctio

n R

ate

Network Size



0

0.2

0.4

0.6

0.8

1

100 200 300 400 500 600

Fals

e P

osi

tive R

ate

Network Size



Fig. 8. The Impact of Symptom Loss Ratio in AIR (a) Detection time T (b) Detection rate α (c) False positive rate β

(a) (b) (c)

0

20

40

60

80

100

100 200 300 400 500 600

Dete

ctio

n T

ime (

s)

Network Size

Active SSR=5%Active SSR=3%Active SSR=1%

Passive SSR=5%Passive SSR=3%Passive SSR=1%

0

0.2

0.4

0.6

0.8

1

100 200 300 400 500 600

Dete

ctio

n R

ate

Network Size



0

0.2

0.4

0.6

0.8

1

100 200 300 400 500 600

Fals

e P

osi

tive R

ate

Network Size



Fig. 9. The Impact of Spurious Symptoms in AIR (a) Detection time T (b) Detection rate α (c) False positive rate β

VI. SYSTEM EVALUATION

In this section, we describe system evaluation in Action In-tegrated fault Reasoning (AIR) and Overlay Action Integratedfault Reasoning (O-AIR) frameworks via extensive simulationstudies and Internet experiments on the Planet-lab [2]. Theevaluation study considers fault detection time (T ) as theperformance parameter and the detection rate (α) and falsepositive rate (β) as the accuracy parameters.

A. Evaluation on AIR via Simulations

In our simulation, the number of network objects variesbetween [60, 600]. Each network object generates differentfaults, and each fault is associated with [2, 5] symptoms. Thus,the total symptoms vary from 120 to 3,000. We use fault cardi-nality (FC), symptom cardinality (SC) and action cardinality(AC) to describe the Symptom-Fault-Action matrix, such thatFC defines the maximal number of symptoms that can beassociated with one fault; SC defines the maximal number offaults to which one symptom might correlate; and AC definesthe maximal number of symptoms that one action can verify.We set p(fi) and p(si|fj) in ranges [0.001, 0.01] and (0, 1],respectively. Our simulation model also considers the follow-ing parameters: Initial Passive Period (IPP ); Symptom ActiveCollecting Rate (SACR); Symptom Passive Collecting Rate(SPCR); Symptom Loss Ratio (SLR); Spurious SymptomRatio (SSR); and Fidelity Threshold FDTH .

The major contribution of this work is to offer an efficientfault reasoning technique that provides accurate results even inthe worst cases (e.g., SLR and SSR are high). We show howthese factors affect the performance (T ) and accuracy (α andβ) of our approach and a passive fault reasoning approach.

1) The Impact of the Symptom Loss Ratio: Symptom losshides fault indications, which negatively affects both theaccuracy and performance of the fault localization process.In this simulation, we set SSR = 0; IPP = 10sec;SACR = SPCR = 100 symptoms/sec; and vary SLR from10% to 30%. With the increase of SLR, the passive faultreasoning system may become infeasible. In this case, we haveto reduce the fidelity threshold so that the passive reasoningprocess can converge in a reasonable time. In Fig. 8(a), wecan see that, in contrast to the passive approach, the AIRsystem always reaches a relatively high fidelity threshold withaverage performance improvement of 20% to 40%. In additionto the performance improvement, the AIR system also showshigh accuracy. With the same settings, Fig. 8(b) and (c) showthat the active approach gains 20-50% improvement of thedetection rate and 20-60% improvement of the false positiverate, even with much higher fidelity criteria over the passivereasoning approach.

2) The Impact of Spurious Symptoms: The spurious symp-toms are also regarded as the observation noise, which couldseriously affect fault reasoning because they provide mis-leading information rather than losing information. In thissimulation, we set SLR = 0; IPP = 10s; SACR = 100symptoms/sec. The relative signal-noise ratio can be calculatedas SNR = 1−SSR

SSR if SLR = 0. Fig. 9(a) shows thaton average AIR shows 10-20% better performance than thepassive approach, even with high fidelity value. With the sameexperiment settings, in Fig. 9(b) and (c), AIR shows accuracyimprovement of 10-50% for the detection rate and 10-40% forthe false positive rate over the passive approach.

(a) (b) (c)

0

20

40

60

80

100

120

140

160

180

100 200 300 400 500 600 700 800 900 1000

Incr

ease

Ratio

of D

ete

ctio

n T

ime (

%)

Increase Ratio of Network Size (%)

Active SLR=SSR=0 SPCR=SACR=50Active SLR=SSR=0 SPCR=SACR=100Active SLR=SSR=0 SPCR=SACR=200

0

50

100

150

200

250

300

350

100 200 300 400 500 600 700 800 900 1000

Incr

ease

Ratio

of D

ete

ctio

n T

ime (

%)


Active SLR=30% SSR=0 SPCR=SACR=100Active SLR=20% SSR=0 SPCR=SACR=100Active SLR=10% SSR=0 SPCR=SACR=100

0

100

200

300

400

500

100 200 300 400 500 600 700 800 900 1000

Incr

ease

Ratio

of D

ete

ctio

n T

ime (

%)


Active SLR=0 SSR=5% SPCR=SACR=100Active SLR=0 SSR=3% SPCR=SACR=100Active SLR=0 SSR=1% SPCR=SACR=100

Fig. 10. The Impact of Network Size in AIR (a) Without symptom loss and spurious symptoms (b) With symptom loss (c) With spurious symptoms

0

10

20

30

40

50

60

70

80

0 5 10 15 20 25 30 35 40

Num

ber

of A

ctio

ns

Symptom Loss Ratio (%)

AIR: small network (50 objects, 5 faults)AIR: medium network (200 objects, 20 faults)

AIR: large network (600 objects, 60 faults)AFR: small network (50 objects, 5 faults)

AFR: medium network (200 objects, 20 faults)AFR: large network (600 objects, 60 faults)

Fig. 11. Intrusiveness Evaluation

3) The Impact of Network Size: In this section, we examinethe scalability of AIR when the network size and the numberof symptoms significantly increase. To show this, we measureAIR detection time under different scenarios: (1) withoutsymptom loss and spurious symptom (Fig. 10(a)); (2) withsymptom loss only (Fig. 10(b)), and (3) with spurious symp-toms only (Fig. 10(c)). In all three cases, when the networksize increases 10 times (from 100% to 1000%), the detectiontime slowly increases by 1.7 times (170%), 3.7 times (370%),and 5.8 times (580%) in Fig. 10(a), (b) and (c), respectively.This shows that even in the worst case scenario (Fig. 10(c)),the growth in network size causes a slow linear increase ofAIR performance.

4) The Impact of Symptom Loss on AIR Intrusiveness:AIR intrusiveness is measured by the number of total actionsperformed to localize faults. As shown in Section IV, theintrusiveness of AIR is algorithmically minimized by (1) con-sidering the fault hypothesis of high credibility, and (2) select-ing the minimum cost actions based on the greedy algorithmdescribed in Section IV-C. We conducted experiments to assessthe intrusiveness (i.e., action cost) when the symptom loss ratioincreases. In this simulation, we set Action Cardinality (AC)to 3. Apparently, the higher the AC, the more symptoms canbe verified by taking a single action and thus the less thetotal required actions. For comparison, we also implementedan Active Fault Reasoning algorithm called AFR. AFR takesthe observed symptoms (SO) as the input to find all relatedfaults (i.e., FSO

). Then, AFR selects the actions to verify

all possible symptoms that can be caused by FSO. Fig. 11

shows that, with a different scale of network sizes and theprior fault probability as high as 10%, the number of actionsrequired for AIR increases slowly linearly (from 1 - 22) evenwhen the symptom loss ratio significantly increases (from 2%-35%). However, compared to AIR, AFR may take up to200% more actions to find the satisfactory fault reasoningresult. For example, in a large-scale network of 600 objects,the number of actions performed by AIR did not exceed the0.37 action/fault ratio, but exceeded the 1.19 action/facult ratioin AFR.

B. Evaluation of O-AIR via Simulations

In the following, we describe our simulations of O-AIRframework. O-AIR can be applied to locate various networkfaults, such as delay, loss, and jitter. In our simulations, weassume the same type of overlay symptoms are collected.We consider the following dimensions and parameters forsimulations and experiments.• Underlying Network Topology and Size: We use a syn-

thetic topology generator BRITE [19] with three types oftopology models. In addition, we import to BRITE usingreal network AS topology data from Skitter [20] witheach set more than 20,000 ASes for the evaluation.

• Overlay Network Topology: For given N overlay nodes,we create d10%Ne number of overlay applications. Eachoverlay application constructs a tree with 10 overlaypaths. The number of overlay links in each overlay pathis uniformly distributed between [2, 6].

• User Observation Ratio: For each overlay application, itcan decide how many critical overlay paths are monitored.

• Fault Ratio: The prior fault probability of the overlaynodes and underlay networks are uniformly distributed in[0.01, 0.2] and [0.001, 0.1]. Here we assume the underlaynodes (e.g., network routers) are more reliable than theoverlay nodes (e.g., user workstations).

1) The Impact of Network Topology and Size: We useBRITE [19] to create four types of topologies with differentnetwork sizes from 100 to 30,000 network objects: (1) AS-level Waxman model; (2) AS-level Barabasi-Albert model;(3) Hierarchical model; (4) Skitter [20]. As shown in Fig.12,when the network size increased 300 times, the corresponding

(a) (b) (c)

0

10

20

30

40

50

60

100 1000 10000 100000

Det

ectio

n tim

e (s

)

Network size

AS-WasmanAS-BA

HierarchicalSkitter

50

60

70

80

90

100

100 1000 10000 100000

Det

ectio

n R

ate

(%)

Network size

AS-WasmanAS-BA

HierarchicalSkitter

0

10

20

30

40

50

100 1000 10000 100000

Fai

se P

ositi

ve R

ate

Network size

AS-WasmanAS-BA

HierarchicalSkitter

Fig. 12. The Impact of Overlay Network Topology and Size (a) Detection time T (b) Detection rate α (c) False positive rate β

detection time increased only 11 times (approximately from5 to 55 seconds). For the detection rate and the false positiverate, the change rate is within 10% of the increase in networksize.

2) The Impact of Overlay Symptom Loss: The identificationand explanation of power laws has become an increasinglydominant theme in the recent body of network topologyresearch [19]. In the following simulation, we use only BRITEHierarchical Topology with the ASBarabasiAlbert and Router-Waxman model, which can properly represent power law nodedistribution. In this simulation, we simulate three differentscenarios: (1) small-size network (100 nodes); (2) medium-size network (1,000 nodes); (3) large-scale network (10,000nodes). For all generated network faults, we choose an OverlaySymptom Loss Ratio increased from 10% to 100% with aincrease of 10%. Obviously, with the increase of the Over-lay Symptom Loss Ratio, the detection time is significantlyincreased (can be 50 times higher) as shown in Fig. 13. Onecause is insufficient symptom observation, thus we need takemore actions to collect the relevant information. Also, theincomplete observation can result in an uncredible hypothesis,which requires further actions to enhance the reasoning result.However, the detection rate is relatively stable, even decreasedwith the increase in the symptom loss ratio. The false positiveratio is also effectively controlled because of the integrationof the active actions.

C. The Experiments on Planet-Lab

We experimented with the O-AIR system on the Planet-Labtestbed. In the following, we elaborate the major steps.

1) Creating an OSP: The Planet-Lab nodes are mainlydistributed in five geographic zones: North America, SouthAmerica, Europe, Asia and Oceania zones. North Americanodes can be further classified as EDU and non-EDU nodes.Thus, the Planet-Lab nodes can be classified into six categoriesas shown in Table II. We create an OSP using the followingparameters:

• the total number of overlay nodes and their distribution:Based on the industrial OSP infrastructure and the avail-able resource from the Planet-Lab as shown in Table I,we select 100 well-distributed Planet-Lab nodes, eachfrom a different organization/network to construct anexperimental OSP as shown in Table II.

• the total number of overlay links: Technically, for 100overlay nodes, we can construct 100×99 = 9, 900 overlaylinks. (Here, we regard loab as different from loba.) Forsimplicity without losing generality, in our experimentswe randomly select 3, 000 overlay links.

• the total number of overlay paths: Each overlay pathconsists of a sequence of overlay links. From the aboveselected 3, 000 overlay links, we construct 500 overlaypaths, and each overlay path consists of 3-6 acyclicoverlay links.

TABLE IOVERLAY NETWORK DISTRIBUTION

Overlay Service ProvidersPlanet-Lab Akamai

Nodes ∼800 ∼25,000Networks ∼400 ∼1,000Countries ∼30 ∼70Applications CoDeeN, CoMon Yahoo, Facebook

TABLE IIEXPERIMENTAL OVERLAY TOPOLOGY

Planet-Lab Node Dist OSP InfrastructureN. America (edu) (346) 40N. America (non-edu) (28) 10S. America (18) 8Europe (230) 20Asia (138) 20Oceania (9) 2

2) Building the ONP: In this experiment, we use PacketLoss as the monitored object. We built the ONP in differentperiods of one, two and four weeks, respectively, and denotedthe corresponding ONP as ONP1, ONP2 and ONP4. Whenconstructing the ONP, each overlay node sends a 40-byte UDPpacket every 30s to the corresponding overlay nodes that areon the same selected overlay links. In order to monitor thestatus of the selected overlay nodes (e.g., when an overlaynode is down), from the same network of each node, wechoose another 100 overlay nodes as the mirrored overlaynodes to ping their peers and log the test results. In practice,there is potentially a scalability issue in constructing ONP ifthe number of overlay nodes and the corresponding overlaylinks is significantly increased. Clustering overlay nodes basedon their network properties (e.g., the nodes on the same

(a) (b) (c)

0

100

200

300

400

500

10 20 30 40 50 60 70 80 90 100

Det

ectio

n T

ime

(s)

Overlay Symptom Loss Ratio (%)

Small-sized Network (100 nodes)Medium-sized Network (1,000 nodes)Large-scale Network (10,000 nodes)

30

40

50

60

70

80

90

100

10 20 30 40 50 60 70 80 90 100

Det

ectio

n R

ate

(%)



0

10

20

30

40

50

10 20 30 40 50 60 70 80 90 100

Fal

se P

ositi

ve R

ate

(%)



Fig. 13. The Impact of Overlay Symptom Loss Ratio (a) Detection time T (b) Detection rate α (c) False positive rate β

network or using the same ISP can be clustered together)could be an effective approach to overcoming this problem.We leave this as our future work. For all 3, 000 overlay linksduring the period of constructing ONP1, ONP2 and ONP4,the overall link loss ratio is 1%, 4% and 7% respectively.Moreover, for all lossy links in each ONP, 37%, 33% and29% lossy links were caused by overlay node failures. TheONP information is aggregated to a management workstation(MW ) and updated periodically. In our experiment, construct-ing the ONP is a continuous process and the correspondingnetwork measurement is conducted at a fixed rate (i.e., onceevery 30s). The monitoring results can be used to later verifythe fault reasoning results. In practice, the network samplingperiod in constructing ONP can be significantly increased overtime to reduce the network intrusiveness.

3) Monitoring Overlay Paths: An end-to-end overlay pathis represented as a sequence of overlay links. In practice, theoverlay paths are formed and can be tracked by the correspond-ing overlay applications. In our experiment, a source overlayrouting table, which contains the corresponding sequence ofoverlay links for each overlay path, is assigned to everysource overlay node of each constructed overlay path. A sourcenode sends UDP testing packets to the next overlay nodeaccording to the source overlay routing table, until the last onealong this overlay path. Once received the testing packets, thelast overlay node sends an acknowledgement packet with thecorresponding sequence number of each test packet directly tothe source overlay node without following the reverse sourceoverlay routing table. At the same time, a log is also saved andsent to the management workstation (MW ) in order to laterverify if a symptom is a false alarm due to the loss of theacknowledgment packets. If a source node cannot receive theexpected acknowledgements before timeout, a symptom willbe generated and sent to the management workstation (MW )with an NTP timestamp and the corresponding overlay pathinformation (e.g., the sequence of overlay links).

4) Overlay Fault Reasoning via O-AIR: For all receivedsymptoms within the same observation window (we use 15minutes, which is a reasonable estimated average fault pe-riod [13]), we dynamically construct an O-SFA model basedon an existing overlay network profile (i.e., ONP1, ONP2 orONP4). In this experiment, the overlay actions are designed,based on 3-tuple overlay model of each overlay link, to checkthe status of the source and destination overlay nodes and

The Impact of ONP on Intrusiveness

0

2

4

6

8

10

12

14

16

1 2 3 4 5

Experiment Number

Number of Actions ONP4

ONP2

ONP1

Fig. 14. O-AIR Intrusiveness Evaluation on Planet-Lab

�� !"#�$��%&� '�(�)�� &!� �� ! ��'�*�� +� � ' ( ) �,-./012/34 5627/089:�; <=>=?>@AB CD>= 89:'; <=>=?>@AB CD>= 89:); <=>=?>@AB CD>=89:�; <=>=?>@AB CD>= 89:'; <=>=?>@AB CD>= 89:); <=>=?>@AB CD>=89:�; EDFG= HAG@>@I= CD>= 89:'; EDFG= HAG@>@I= CD>= 89:); EDFG= HAG@>@I= CD>=Fig. 15. O-AIR Accuracy Evaluation on Planet-Lab

the link quality (i.e., the packet loss ratio recorded whenconstructing the corresponding ONP).

5) Experiment Results: We consider an experiment result(also called an experiment snapshot) valid only if there aremore than 100 overlay components (i.e., |FSO | > 100) relatedto the observed overlay symptoms within the same observationwindow (i.e., 15 minutes). Since the fault rate is relativelylow in the real network, in our experiments we collectedfive snapshots in total from the same overlay experimentalenvironment as described above. Firstly, we evaluate theintrusiveness (i.e. the total number of verification actions)of O-AIR with the different ONP. Fig.14 shows that O-AIRpresented very promising results regarding its intrusivenessfactor, with an average of four, eight and 11 verification actionswhen using ONP4, ONP2 and ONP1, respectively. Thedetection time of five experiments based on different ONPs isdistributed between five to 25 seconds, which is proportionalto the number of required actions. Please note that by using

a ONP built up within a shorter period (e.g., ONP1), O-AIR may take much more actions because of lacking enoughstatistic information of the overlay network. For example, ifan overlay link is first time reported with some observedsymptoms (i.e., no related record can be found in the ONP),exhaustively taking all verification actions on every related 3-tuple component is inevitable. This could explain why usingONP4 can reduce the required verification actions.

It is also interesting to note how an ONP impacts theaccuracy of O-AIR. Accordingly, for each of the experimentsdiscussed above, let O-AIR adopt ONP1, ONP2 and ONP4,respectively, to analyze the root causes. As shown in Fig. 15,there are no obvious differences in the fault reasoning accuracyby using ONP1 and ONP2. However, the reasoning accuracybased on ONP4 outperforms both ONP1 and ONP2. Thisresult seems to show that the statistics based on the long-term network monitoring results can provide a more reliablereference to get a more accurate reasoning result.

VII. CONCLUSION AND FUTURE WORK

In this paper, we present a novel technique called ActiveIntegrated fault Reasoning or AIR. AIR is designed to minimizethe intrusiveness of investigation actions (e.g., probings) viaincrementally enhancing the fault hypothesis and optimizingthe investigation action selection process. With the adventof the overlay service model, how to effectively managingoverlay service networks becomes a significant challenge. Inthis paper, we also propose a novel end-to-end user-level over-lay fault diagnosis framework (called O-AIR). O-AIR createsthe correlation matrix dynamically as O-SFA and seamlesslyintegrates the passive and active fault reasonings. The resultsof our simulation and Internet experiments show that both AIRand O-AIR are efficient and accurate.

In our future work, we will investigate the automatic cre-ation of the Symptom-Fault-Action correlation matrix fromthe network topology and high-level service specifications,especially for dynamic overlay networks.

REFERENCES

[1] Akamai Technology Overview, http://www.akamai.com/html/technology.[2] PlanetLab: http://www.planet-lab.org.[3] JOSEPH L HELLERSTEIN, KAAN KATIRCIOGLU, AND MAHESWARAN

SURENDRA. A Framework for Applying Inventory Control to CapacityManagement for Utility Computing, IEEE/IFIP Integrated Management(IM’2005), (May 2005).

[4] GEORGE J. LEE, LINDSEY POOLE. Diagnosis of TCP Overlay Con-nection Failures using Bayesian Networks SIGCOMM06 WorkshopsSeptember, 2006, Pisa, Italy

[5] S. KLIGER, S. YEMINI, Y. YEMINI, D. OHSIE, AND S. STOLFO.A coding approach to event correlation, Proceedings of the FourthInternational Symposium on Intelligent Network Management, (1995).

[6] EHAB AL-SHAER, YONGNING TANG. QoS Path Monitoring for Mul-ticast Networks, Journal of Network and System Management (JNSM),(2002).

[7] YONGNING TANG, EHAB S. AL-SHAER, RAOUF BOUTABA. ActiveIntegrated Fault Localization in Communication Networks, IEEE/IFIPIntegrated Management (IM’2005), (May 2005).

[8] M. STEINDER AND A. S. SETHI . Increasing robustness of fault local-ization through analysis of lost, spurious, and positive symptoms, InProc. of IEEE INFOCOM, (New York, NY, 2002)

[9] M. STEINDER AND A. S. SETHI . Probabilistic Fault Diagnosis inCommunication Systems Through Incremental Hypothesis Updating,Computer Networks Vol. 45, 4 pp. 537-562, (July 2004)

[10] I. RISH, M. BRODIE, N. ODINTSOVA, S. MA, G. GRABARNIK. Real-time Problem Determination in Distributed Systems using Active Prob-ing, IEEE/IFIP (NOMS), (Soul, Korea, 2004).

[11] BRODIE, M., RISH, I. AND MA, S.. Optimizing Probe Selection forFault Localization, IEEE/IFIP (DSOM), (2001).

[12] K. APPLETY et al. Yemanja - a layered event correlation system formulti-domain computing utilities, Journal of Network and SystemsManagement, (2002).

[13] VERN PAXSON. End-to-End Routing Behavior in the Internet,IIEEE/ACM Transactions on Networking, (1996).

[14] THOMAS H. CORMEN, CHARLES E. LEISERSON, RONALD L. RIVESTAND CLIFFORD STEIN. Introduction to Algorithms, Second Edition, TheMIT Press ,

[15] J. GUO, G. KAR AND P. KERMANI. Approaches to Building SelfHealing System using Dependency Analysis, IEEE/IFIP (NOMS), (Soul,Korea, 2004).

[16] J. PEARL. Probabilistic Reasoning in Intelligent Systems, Morgan Kauf-mann Publishers, (San Francisco, CA (1988)).

[17] G. LIU, A. K. MOK, AND E. J. YANG. Composite events for networkevent correlation, Integrated Network Management VI, pages 247260,(Boston, MA, May 1999).

[18] G. JAKOBSON AND M. D. WEISSMAN. Alarm correlation, IEEE Net-work, pages 5259, (Nov. 1993).

[19] A. MEDINA, I. MATTA, AND J. BYERS. On the origin of power lawsin Internet topologies, ACM Computer Communication Review, (Apr.2000).

[20] Skitter, CAIDAs topology measurement tool,http://www.caida.org/tools/measurement/skitter/.

[21] YAO ZHAO, YAN CHEN, AND DAVID BINDEL. Towards Unbiased End-to-End Network Diagnosis, in Proc. of ACM SIGCOMM, (2006).

[22] M. COATES, A. HERO, R. NOWAK, AND B. YU. Internet Tomography,IEEE Signal Processing Magazine, vol. 19, no.3, (2002).

[23] M. ZHANG, C. ZHANG, V. PAI, L. PETERSON, AND R. WANG. Plan-etSeer: Internet Path Failure Monitoring and Characterization in Wide-Area Services, roc. Sixth Symposium on Operating Systems Design andImplementation, (December 2004).

Yongning Tang has been an assistant professor in the School of InformationTechnology at Illinois State University since 2007. His primary research areasare network monitoring, fault management, policy-driven network manage-ment, and network security. Prof. Tang has many journal and conference pub-lications in the area of distributed monitoring, fault localization, and multicastand overlay management.

Ehab Al-Shaer has been an Associate Professor in the School of CTI atDePaul University since 1998. He is Co-Editor of a number of books inthe area of multimedia management and end-to-end monitoring. Prof. Al-Shaer has served as a Program Co-chair for a number of conferences in areaincluding IM 2007, POLICY 2008, MMNS 2001, and E2EMON 2003-2005.He was a Guest Editor for a number of journals in his area. He has also servedas a TPC member, session chair, and tutorial presenter for many IEEE, ACM,USENIX and IFIP conferences in his area including INFOCOM, ICNP andIM/NOMS and others. His research is funded by NSF, Cisco, Intel, and SunMicrosystems. He received his M.S. and Ph.D. in Computer Science fromNortheastern University, Boston, MA and Old Dominion University, Norfolk,VA in 1994 and 1998 respectively.

Raouf Boutaba received the MSc. and PhD. Degrees in Computer Sciencefrom the University Pierre & Marie Curie, Paris, in 1990 and 1994 respec-tively. He is currently a Professor of Computer Science at the Universityof Waterloo. His research interests include network, resource and servicemanagement in wired and wireless networks. He is currently a distinguishedlecturer of the IEEE Communications Society, the chairman of the IEEETechnical Committee on Information Infrastructure and the IEEE TechnicalCommittee on Autonomic Communications. He has received several bestpaper awards and other recognitions such as the Premier’s research excellenceaward.

Efï¬cient Fault Diagnosis Using Incremental Alarm Correlation and

Documents

Transcript of Efï¬cient Fault Diagnosis Using Incremental Alarm Correlation and