Alarm correlation - IEEE Network
Embed Size (px)
Transcript of Alarm correlation - IEEE Network
Alam Correlation Correlating multiple network alarms improves telecommunications network surveillance and fault management. m..m.......
Gabriel Jakobson and Mark D. Weissman
GABRIEL JAKOBSON b rr pnricrpul tnmrher. of rechnicirl stuff (11 GTE Lahoruroi?es.
NetAlertl:M is U traricvnark of GTE TeleconimLinicrr- tion Ser-cice.y.
ALLINK is a trade- mark of NYNEX Corpo- ration.
ARTIMTII is a trademark of Inference Corporatioil.
NMSlCoru7:b1 is LI tratle- mark of Teknekron Com- munications Systems.
odern telecommunication networks may produce thousands of alarms perday. makingthe taskofreal-time networksurvcillance and fault man- agement difficult. Due to the large
volume ofalarms, network operators frequently over- look or misinterpret them. To reduce the number of alarms displayed on operators termin, I 1. s.current network management systems apply alarm filter- ing procedures or. in the case of bursts of alarms. send them directly to a printer or database.
In this article, we will consider a relatively new process of real-time network management. alarm correlation. Alarm correlation is aconceptu, 1 I tnter- pretat ion of multiple a larms such that ;I new meaning is assigned to thesc alarms. I t is a gcner- ic process that underlies differcnt network man- agement tasks such as context-dependent alarm f i 1 t e ri ng, a I a rm genera I iza t i on. n e t\v o r k fa u 1 t diagnosis, generation of corrective actions. proac- tive maintenance, and network behavior trend anal- ysis.
T h e goal of this art icle is twofold: first. t o introduce an alarm correlation modcl and sec- ond, to describe the intelligent management plat- form for alarm correlation tasks ( IMPACT) . which implements the proposed model. Our approach to alarm correlation is based on the principles of model-based reasoning (MBR) [ I ] . As in MBR. we will define two basic components of the over- all alarm correlation model: the structural c o n - ponent, which describes the network elements (NEs) and their connectivity and containment relations; and the behavioral component, which descrihes alarms and correlation.
T h e prototype of the I M P A C T system has been developed at GTE Laboratories. It pro\ ides an intelligent environment for developing alarm correlation applications, and for real-time alarm monitoring. IMPACT has been uscd at G T E business units to build two network alarm corre- lation applications: AMES, for a land-based tclecom- munication network: and CORAL. for a cellular network.
Alarm correlation. a s a subject of research and system development, has been discussed in scver-
al works. The aspects of time and space correla- t ion of network events in the network t rou - bleshooting domain were discussed in , where a knowledge-based approach was developed that dcscribed NEs and network events as knowledge- base entities. The conceptual approach to alarm cor- relation was discussed in (31, A structural-phrase grammar-based approach to describe network connectivity and alarm correlation conditions was introduced in . An alarm correlation model was proposed in [SI. where alarms caused by a single common fault were considered. Interpreta- tion and correlation of events has been analyzed i n other areas. such as electric power systems , nuclear-power-plant alarm management , and patient-care monitoring.
In the network management area, several ven- dors have incorporated expert systems into theirplat- forms to support alarm correlation capabilities. NMS/CoreT from Teknekron Communications Systems  includes programs to perform alarm filtering andcorrelation functions. The Sinergiasys- tem from CSELT. Italy . first uses expert sys- tem rules t o recognize alarm correlation patterns and instantiate network fault hypotheses, and then applies heuristic search to determine the best solution among the hypotheses. ALLINKTM Operations Coordinator from NYNEX [ 101 uses an expert system to filter network alarms.
The rest of the article is organized as follows. The following section describes the basic notions associated with alarm correlation, and the section after that discusses the conceptual framework of alarm correlation. Next. we describe the struc- tural component of the alarm correlation model, and then the behavioral component. An overview of the IMPACT system is given, and conclusions and future work are discussed.
Basic Notions of the Alarm Correlation Domain
n this section, we will give a short informal I review of basic notions that we will use to explain the alarm correlation domain and its applications.
Faults and Alarms
A fault is a disorder occurring in the hardware or software of the managed network. Faults happen within the managednetworkor itscomponents.while alarms are external manifestations offaults. Alarms defined byvendors and generated by network equip- ment are observable by network operators. We areconsidering only alarms mediated by alarm mes- sages. Similar alarm messages with different time stamps are separate alarms. Faults can be causal- ly related, thus forming an acyclic fault propaga- tion graph, or independent (causally unrelated). Externalobservation of alarms may instill an impres- sion that one alarm causes another. However. the causality is not between alarms, but rather between faults.
Alarm Correlation Alarm correlation is a conceptual interpretation of multiple alarms such that new meanings are assigned to these alarms. It is a generic process that underlies different network management tasks:
Compression: the reduction of multiple occur- rences of an alarm into a single alarm.
Count: the substitution of a specified number of occurrences of alarms with a new alarm.
Suppression: inhibitinga low-priority alarm in the presence of a higher-priority alarm.
Boolean: substitution of a set of alarms satislly- ing a Boolean pattern with a new alarm.
Generalization: reference toanalarm by itssuperclass. Alarm correlation may be used for network
fault isolation and diagnosis, selecting corrective actions, proactive maintenance, and trend analysis.
To illustrate the use of alarm correlation. we will give anexample basedon actual events that hap- pened on a private telecommunication network. Because of an administrative error at a primary network control center, a circuit disconnect order was incorrectly sent to a common carrier. hut soon after withdrawn. An additional error by the common carrier led to the disconnect order being carried out despite the cancellation. This meant that alivecircuitwasdisconnected,causingacatastrophic failure on a major DS3 link between city A and city B (Fig. 1). A normal facility disconnect. when performed by network operations personnel, invokes automatic loopback conditions o n digital cross- connect systems (DCSs) at both ends of the cir- cuit. Since thisisanormal DCS behavior, the loopback conditions a re not reported. The packet and voice switches having logical trunks over the dis- connected circuit sent large volumes of call pro- cessing failure messages to the primary network control center. The operators puzzled for an hour before they realized what had happened. The task at hand was to correlate the call-processing alarms from the switches with the absence of alarms from the DCSs, and recognize that the trunk was actually disconnected. This was compli- cated by the incorrect record in the database showing that the circuit was live.
Subjectsforcorrelation could be any events affcct- ing the network. These may be environmental- s ta t e pa r a m e t e r s, the ne two r k man age In c n t context, or events invoked by the user or external systems. Correlations are defined over a time interval o r window. When a situation is recog- nized and a correlation asserted, it remains active
H Figure 1. Facilih dirconnect
H Figure 2. (a) Conrlrrtiori o f causally dtpetiderit alanns; (b) and (c) correlu- tiori of cuitsally iti&ptvi&tit alarms.
until i t expires o r is externally cleared. Corrcla- tions may he subsumed by higher-level correlations.
The alarm correlation model introduced in thisarti- cle distinguishes hetwcen corrclations and c o w - lation rules [ 1 I ] . A correlation is a statement about a e n t s happening on the network; for example. Bad- Card-Correlation states that some port contains a faulty port card. A correlation rule defines thc conditions under which correlations are asserted. Forexample, ifthcre isa redcarriergroupalarm (CGA) from one DCS. and a Yellow-CGAfrom another. and these DCSs are connected. then Bad-Card-Corre- Iation will be asserted. The conditional part of the rule may contain a complex Boolean pattern rcc- ognizing alarms. NEs. and correlations, as well a s structural. temporal and other relations.
Fault Diagnosis One of the major applications of alarm correla- tion is network fault diagnosis. N o t all faults exhibit alarms. These faults can be recognized indirectly by correlating available alarms. Figure 2a illustrates this, showing that correlation c 1 detects the fault.fl. and correlation c? detects the fault ,f2. Correlatingcl andc3into thecorrelationcOallows diagnosis of the fault /U. Correlation between alarms due to a common fault is a transitive. reflexive. and symmetric relation (i.e.. an equivalence relation. its noted in ). If a single alarm is a manifesta- tion of multiple faults, this relation may not hold. For example. i f alarm a (Fig. 7b) is caused by fault f l orfaultp. but not both (anexclusive ORcon- dition). then correlations c . 1 and e? arc formed
I E t E Network Ncnember 190.7 53
with acommon component alarm, and consequently the correlation relation is not transitive.