Network and Voice Management Green Book ENU

8/3/2019 Network and Voice Management Green Book ENU

http://slidepdf.com/reader/full/network-and-voice-management-green-book-enu 1/182

CA GREEN BOOKS

Network andVoiceManagementAn Integrated Solution forNetwork Fault andPerformance Management

OVERVIEW OF CONVERGED NETWORK CHANGES ANDMANAGEMENT NEEDS

BEST PRACTICES FOR DEPLOYING CA’S INTEGRATEDSOLUTION FOR NETWORK AND VOICE MANAGEMENT



LEGAL NOTICE

This publication is based on current information and resource allocations as of its date of publication and

is subject to change or withdrawal by CA at any time without notice. The information in this publication

could include typographical errors or technical inaccuracies. CA may make modifications to any CA

product, software program, method or procedure described in this publication at any time without

notice.

Any reference in this publication to non-CA products and non-CA websites are provided for convenience

only and shall not serve as CA’s endorsement of such products or websites. Your use of such products,

websites, and any information regarding such products or any materials provided with such products or

at such websites shall be at your own risk.

Notwithstanding anything in this publication to the contrary, this publication shall not (i) constitute

product documentation or specifications under any existing or future written license agreement or

services agreement relating to any CA software product, or be subject to any warranty set forth in any

such written agreement; (ii) serve to affect the rights and/or obligations of CA or its licensees under

any existing or future written license agreement or services agreement relating to any CA software

product; or (iii) serve to amend any product documentation or specifications for any CA software

product. The development, release and timing of any features or functionality described in this

publication remain at CA’s sole discretion.

The information in this publication is based upon CA’s experiences with the referenced software

products in a variety of development and customer environments. Past performance of the software

products in such development and customer environments is not indicative of the future performance of

such software products in identical, similar or different environments. CA does not warrant that the

software products will operate as specifically set forth in this publication. CA will support only the

referenced products in accordance with (i) the documentation and specifications provided with the

referenced product, and (ii) CA’s then-current maintenance and support policy for the referenced

product.

Certain information in this publication may outline CA’s general product direction. All information in this

publication is for your informational purposes only and may not be incorporated into any contract. CA

assumes no responsibility for the accuracy or completeness of the information. To the extent permitted

by applicable law, CA provides this document “AS IS” without warranty of any kind, including, without

limitation, any implied warranties of merchantability, fitness for a particular purpose, or non-

infringement. In no event will CA be liable for any loss or damage, direct or indirect, from the use of

this document, including, without limitation, lost profits, lost investment, business interruption, goodwill

or lost data, even if CA is expressly advised of the possibility of such damages.

COPYRIGHT LICENSE AND NOTICE:

This publication contains sample application programming code and/or language which illustrate

programming techniques on various operating systems. Notwithstanding anything to the contrary

contained in this publication, such sample code does not constitute licensed products or software under

any CA license or services agreement. You may copy, modify and use this sample code for the

purposes of performing the installation methods and routines described in this document. These

samples have not been tested. CA does not make, and you may not rely on, any promise, express or

implied, of reliability, serviceability or function of the sample code.

Copyright © 2007 CA. All rights reserved. All trademarks, trade names, service marks and logosreferenced herein belong to their respective companies.

2 Network and Voice Management



ACKNOWLEDGEMENTS

CA thanks the following people for their contributions to this CA Green Book:

Principal Authors

Don LeClairSue Andersen

Jason Bryk

Roger Craig

Bill Donoghue

Justin Gagnon

Brian Gollaher

Andrew Haigh

Kathleen Hickey

Mark Hounslow

John Kane

Michael Marks

John Murdough

Barbara O’Toole

Pete Oliveira

Jason Warfield

Dianne Weiss

The principal authors and CA would like to thank the following contributors:

Ajei Gopal

Tricia Bancroft

Lynn Beck

Gregory Buonaiuto

Curtis Lehman

Peter ClairmontDan Lewis

Anders Magnusson

Alexandre Moscoso

Joe Pennachio

David Soares

Peter Skotny

Cheryl Stauffer

Tom Wilson

3: Network and Voice Management




CA PRODUCT REFERENCES

CA Network and Voice Management

eHealth®

eHealth®

for Voice

SPECTRUM®

eHealth® for Voice Policy Manager

eHealth®

E2E Console

eHealth®

Live Health

eHealth®

Traffic Accountant

eHealth® Universal Workflow Integration Modules

eHealth®

Universal Data Integration Modules

eHealth®

Universal Wireless Integration Modules

SPECTRUM®

Infinity

SPECTRUM®

Integrity

SPECTRUM® Xsight

SPECTRUM®

OneClick

SPECTRUM® Service Manager

SPECTRUM®

Report Manager

SPECTRUM®

Alarm Notification Manager

SPECTRUM®

ATM Circuit Manager

SPECTRUM® Configuration Manager

SPECTRUM®

Secure Domain Manager

SPECTRUM®

Frame Relay Manager

SPECTRUM®

Microsoft Operations Manager Connector

SPECTRUM® Multicast Manager

SPECTRUM® OSS Integrations

SPECTRUM®

QoS Manager

SPECTRUM®

Remedy ARS Gateway

SPECTRUM®

SNMPv3 Support

SPECTRUM® VPN Manager

SPECTRUM®

Watch Editor

SPECTRUM®

Service Performance Manager

SPECTRUM®

Assurance Server Xsight

SPECTRUM®

Assurance Server Integrity

SPECTRUM®

Assurance Server Infinity



Contents

Chapter 1: Introduction ......................................................................................................9

About This Book .............................................................................................................9

Executive Summary ...................................................................................................... 10

Evolving Requirements for Network and Voice Management............................................10

CA’s Network and Voice Management Solution ..............................................................10Chapter 2: Challenges of Network and Voice Management ....................................................13

Evolution of the Network-to-Service Delivery Platform ......................................................13

Impact on Network Operations Teams............................................................................. 13

Impact on Network Management Software Requirements ..................................................14

Chapter 3: CA’s Network and Voice Management Solution ....................................................17

EITM: CA’s Vision ......................................................................................................... 17

Enterprise Systems Management....................................................................................18

The Value of CA’s Network and Voice Management Solution ...............................................19

A Key Part of CA’s EITM Vision ....................................................................................20

Network and Voice Management for Key Vertical Markets ..................................................20

Telecommunication Service Providers...........................................................................20

Government .............................................................................................................21

Enterprise................................................................................................................. 21

The Components of the Solution..................................................................................... 22eHealth ....................................................................................................................... 22

eHealth Components .................................................................................................. 23

The Benefits of eHealth .............................................................................................. 23

SPECTRUM ..................................................................................................................24

SPECTRUM Components ............................................................................................. 24

The Benefits of SPECTRUM .........................................................................................25

Integration between eHealth and SPECTRUM ...................................................................25

eHealth for Voice .......................................................................................................... 25

The Benefits of eHealth for Voice.................................................................................26

CA Technology Services Network and Voice Management Service Offerings .........................26

Assessment – Understanding the Gaps.........................................................................27

CA Maturity Models .................................................................................................... 28

Design – Building the Right Solution ............................................................................28

Implementation – The Bottom Line of Solution Success..................................................29

Optimization – Anticipating Change .............................................................................29

Why Trust Your Service Availability to CA Technology Services? ......................................29

How the Solution Delivers the Key Points of Value ............................................................30

Effective Service Level Management ............................................................................30

Proactive Service Assurance .......................................................................................31

Rapid Problem Resolution ...........................................................................................32

Predictive Capacity Planning .......................................................................................33




Chapter 4: Deployment Architecture for Network and Voice Management ...............................35

Network Performance Components ................................................................................. 35

E2E Console.............................................................................................................. 35

Live Health ............................................................................................................... 36

Integration Modules ................................................................................................... 36

Distributed eHealth .................................................................................................... 36

Remote Poller ...........................................................................................................37

Report Center ...........................................................................................................37Traffic Accountant .....................................................................................................37

Network Fault Management Components.........................................................................38

Assurance Server ......................................................................................................38

OneClick................................................................................................................... 39

Watch Editor ............................................................................................................. 40

Alarm Notification Manager ......................................................................................... 40

Frame Relay Manager ................................................................................................ 41

ATM Circuit Manager .................................................................................................. 41

Multicast Manager .....................................................................................................42

QOS Manager............................................................................................................ 42

VPN Manager ............................................................................................................ 43

SNMPv3 ................................................................................................................... 43

Secure Domain Manager ............................................................................................44

Configuration Manager ............................................................................................... 44Report Manager ........................................................................................................ 45

Service Performance Manager.....................................................................................45

Service Manager........................................................................................................ 46

Voice Management Components..................................................................................... 47

eHealth for Voice ....................................................................................................... 47

eHealth for Voice Policy Manager.................................................................................47

Deployment Architectures.............................................................................................. 48

Small-to-Medium Enterprise Deployment......................................................................48

Large Service Provider Deployment .............................................................................49

Network Performance Hardware and Software Requirements/Sizing....................................50

Network Fault Management Hardware and Software Requirements/Sizing ...........................50

eHealth for Voice Single PC or Database Server Hardware and Software Requirements ......... 52

Chapter 5: Setting Up and Configuring the Integrated Solution..............................................53

Installing the CA Network and Voice Management Solution Software...................................53

Installation Prerequisites ............................................................................................53

Installation Steps ...................................................................................................... 54

How You Install SPECTRUM......................................................................................... 54

How You Install SPECTRUM OneClick and Report Manager ..............................................55

How You Install eHealth .............................................................................................55

How You Install eHealth for Voice ................................................................................ 56

Configuring the Integrated Solution ................................................................................ 57

Best Practices ...........................................................................................................57

Identify Resources and Use SPECTRUM to Discover Them as Global Collections .................57

Import Global Collections into eHealth.......................................................................... 58

Organize Your Resources by Creating eHealth Groups ....................................................60

Schedule eHealth Discoveries of Global Collections ........................................................61

Network and Voice Monitoring........................................................................................62

Set Up Live Health ..................................................................................................... 62

Forward Live Health Traps to SPECTRUM......................................................................64Customize and Schedule Health Reports to Forward Traps..............................................65

Configure eHealth for Voice to Send Alerts to SPECTRUM................................................66

Configure SPECTRUM to Recognize the eHealth Server ...................................................68

Configure SPECTRUM to View eHealth Alarms ...............................................................69

System Maintenance..................................................................................................... 70

System Backup Archives ............................................................................................70

Data Recovery Best Practices ......................................................................................70




Chapter 6: Gathering System Information from Agents ........................................................71

Deployment and Administration of System Agents............................................................71

Best Practices ...........................................................................................................71

Supported Agents ...................................................................................................... 72

Prerequisites ............................................................................................................. 72

How You Add System Agents in SPECTRUM................................................................... 72

Unicenter NSM Agents................................................................................................ 79

How You Add System Agents in eHealth .......................................................................80Performance Reporting On System Agents.......................................................................80

At-a-Glance Reports ..................................................................................................80

MyHealth Reports for Systems ....................................................................................82

Health Reports for Systems ........................................................................................82

Using Live Trend ....................................................................................................... 82

How You Run Trend Reports for Systems......................................................................84

Top N Reports........................................................................................................... 86

What-If Capacity Trend Reports for Systems.................................................................86

Chapter 7: Service Level Management................................................................................87

Interview Procedures ....................................................................................................88

Interview Questions...................................................................................................88

General Questions .....................................................................................................88

Analysis and Mapping Procedures ...................................................................................89

How You Organize the Resource Information.................................................................89How You Illustrate the Relationships of Resources to Each Other .....................................90

How You Decompose the Information and Mapping to Service Models ..............................90

Example of a Business Service Map to Service Models....................................................90

Creating Service Models and Relationships....................................................................... 92

Key Concepts............................................................................................................ 92

How You Create Service Models................................................................................... 93

Example 1: A Customer Account Access Service ............................................................93

Example 2: Extend the Service to Monitor Critical Processes ...........................................99

Implement Example 2 in SPECTRUM.......................................................................... 101

Example 3: Extend the Service to Include a Response Time Element ............................. 105

Create SLAs............................................................................................................... 108

Key Concepts.......................................................................................................... 108

Create SLAs and Guarantees..................................................................................... 109

Example 4: An SLA for the Customer Account Access Service....................................... 110

How You Implement the A to Z Account Access SLA in SPECTRUM................................. 116

Service and SLA Reporting........................................................................................... 118

Run SPECTRUM Service Manager Customer-Facing Reports .......................................... 118

Service Availability by: Name, Customer, Owner ......................................................... 119

Service Availability Variable Health Level .................................................................... 120

Service Summary by: Name, Customer, Owner ........................................................... 121

Service Summary Variable Health Level ..................................................................... 121

SLA Detail By Customer ........................................................................................... 122

SLA Inventory by Customer ...................................................................................... 123

SPECTRUM Service Manager: Internal Reports ............................................................ 123

Service Health by Service Name................................................................................ 123

Service Inventory.................................................................................................... 125

Top N Worst Performing Services .............................................................................. 126

Top N Worst Performing Services Including All Outage Types ........................................126

Top N Worst Service Outages.................................................................................... 127Top N Worst Service Resources by Total Downtime...................................................... 128

SLA Status Current and Recent by Customer .............................................................. 128

SLA Summary by: Name, Customer, Status................................................................ 129

SLA Summary Warned or Violated ............................................................................. 129

SLA Detail By: SLA Name, Time Range, Last N Periods................................................. 130

SLA Detail with Resource Outages ............................................................................. 133

Customer SLA Summary .......................................................................................... 134




Chapter 1: Introduction

About This Book

The CA Green Book for Network and Voice Management describes how to manage theperformance and availability of converged networks. The CA solution provides proactive

management of voice and data services, ensures that bandwidth and system capacity is

sufficient, and supports business-driven service levels. The solution also provides integrated

network fault and performance management to support the network as a service delivery

platform.

The information contained in this CA Green Book is designed for network operators,

engineering, and technical staff charged with managing voice and data networks. The

deployment examples highlighted in this book present the views of a small enterprise and a

large service provider. This information may be useful for many other network

deployments, but may not meet all of their specific requirements.

This CA Green Book provides an understanding of capabilities that you can deploy today tomanage your converged network. The opening sections provide a strategic view of the

trends toward converged networks, and the subsequent sections present best practices for

deploying and using the CA Network and Voice Management solution to manage the

converged network.

This CA Green Book contains only information about network and voice management. This

is one of a series of CA Green Books designed to help define the capabilities of CA’s key

solutions and provide best practices on how to manage and secure them. Other CA Green

Books will present solutions across a wide range of IT management topics including

systems management, database management, and workload automation.

This Network and Voice Management Green Book is targeted toward CIOs, network

management teams, and technical teams. The book is structured as follows:

Chapters 1-3 Provide CIOs and network managers with an overview of the challenges

of managing converged networks and the value of CA’s Network and Voice Management

solution.

Chapters 4-10 Deliver sample deployments for the products comprising CA’s Network

and Voice Management solution to network managers and other technical personnel.

Provide best practices for planning, deploying, and configuring this solution to speed the

time-to-value for investments made in optimizing the network.

This CA Green Book also covers the following topics:

Technical descriptions of the components that comprise the recommended solution

Best practices for setting up and configuring the components of the solution

Defining and managing network service level management

Best practices for enabling proactive service assurance and resolving problems quickly

Performing capacity planning and management of voice and data networks




Executive Summary

Evolving Requirements for Network and Voice Management

The entire computing infrastructure has been dependent upon the network. Recently, the

use of the network has changed enormously, with the convergence of data, voice, businesscritical applications, and video content traveling over the same network. Network faults and

performance problems have immediate and negative business consequences for

productivity, cost, and revenue.

As a result, the interest and awareness of network fault and performance issues has

expanded to a wider audience of business-oriented and non-technical users who want and

need real-time information. In addition, the network operations team must rapidly learn

new technologies and expand their management responsibilities to support converged data

and voice networks.

In this environment, network management solutions must provide critical information and

management capabilities appropriate to both technical and business users. Network

management solutions need to provide configurable alerts, dashboards, and analyticalcapabilities to all users in addition to delivering traditional fault and performance

management.

Today’s network and voice management solutions must provide the following support:

Heterogeneous Networks They must support data technologies such as internet

protocol (IP), asynchronous transfer mode (ATM), frame relay (FR), and broadband, as

well as voice infrastructures comprised of legacy time-division multiplexing (TDM)

infrastructures, pure-play IP telephony (IPT) infrastructures, and hybrid infrastructures.

Scale They must support vast networks distributed across countries and continents, and

enable central or regional network management teams.

Integration They must be able to use a single solution to manage data and voiceinfrastructures, and critical systems.

Role-Based Service Information They must be able to communicate to external

customers with Service Level Agreements (SLAs) and communicate to internal customers

with either Operational Level Agreements (OLAs) or less formal mechanisms. They must

help both technical and business-minded audiences assess the network’s ability to

support the business, and identify and pinpoint the root cause of problems.

CA’s Network and Voice Management Solution

CA’s Network and Voice Management solution provides converged voice and data

management. It supports the need for IT to proactively manage end-to-end voice and data

services, ensure adequate capacity of bandwidth and systems, and support service levels

defined by the business. This solution is a key part of Enterprise IT Management (EITM),

which is CA’s vision for how to dynamically manage and secure IT environments, enabling

organizations to fully realize the potential of IT.

CA’s Network and Voice Management solution spans both IP and legacy voice technologies,

enabling companies to migrate to IP telephony at their own pace, and reduce the

complexity of managing heterogeneous infrastructures.




Optimization Anticipating Change

Optimization services evaluate ways in which your existing eHealth and SPECTRUM

solutions can be further utilized or fine-tuned. Health check services can include tuning

and reconfiguration, upgrades, and migrations, as well as training and certifications.

With CA’s converged network and voice management solution, the IT organization becomes

more proactive — not reactive — in their approach to managing voice and data. Network

operations teams have the tools to quickly determine the cause of problems. The IT

planning or engineering group can determine if resources are underutilized or reaching a

capacity threshold. The IT department can manage their relationships with key

constituencies with formal service levels. The ability to monitor and report on grade of

service (GoS) and quality of service (QoS) for calls in the voice network is essential to

successful service level management. CA’s Network and Voice Management solution makes

this all possible.




Chapter 2: Challenges of

Network and Voice ManagementAll computing — mainframe, client-server, distributed, grid, or web services computing —

depends on the function of the network on which it resides. Today’s businesses recognize

the dependence of business critical services, such as financial applications and voice, on the

network infrastructure. If the infrastructure is down or slow, the resulting impact on

business-critical applications and services, and the end users who rely on them, creates

loss of revenue and productivity, while increasing costs. On average, Infonetics estimates

that infrastructure downtime and degradation costs enterprises up to 3.6% of revenue

annually.1

Evolution of the Network-to-Service Delivery Platform

Over the past couple of years, the nature of the network itself has changed, and this

change has created significant implications for network management software and for the

operations and IT team members who use it. Not long ago, the main function of the

network was strictly maintaining data connectivity.

Today, the network is considered to be more of a service delivery platform. The network

supports real-time services such as Voice over IP (VoIP), IP Television (IPTV), and video

teleconferencing, all of which have evolved from early adoption and are now approaching

mainstream adoption. Enterprises are increasingly reliant on applications distributed over

wide geographic areas to provide any-time access to employees, customers, and partners

to accomplish critical business functions. Furthermore, the equipment comprising today’s

networks is now embedded with services such as security, high availability, and storage,

which were previously provided by infrastructures found outside of the network.

Impact on Network Operations Teams

The impact of this evolution on the roles and responsibilities of network operations teams

has been significant. As companies rely on real-time services to improve revenue, raise

productivity, and cut costs, any degradation in service has an immediate impact on

customer satisfaction and the business.

Because the network is carrying applications that directly impact the company’s bottom

line, the range of internal constituents who want to understand the performance of the

network has expanded from technical groups to business-oriented, non-technical, line-of-

business managers. Network operations team members now have a whole new set of

internal and external customers to whom they need to communicate their ability to meet

service level commitments. Because this group is non-technical, they need to communicatewith them in business terms, rather than technical terms.

1Infonetics Research, The Cost of Enterprise Downtime, North America 2004.

http://www.infonetics.com. Used with permission.




In addition, because of the tight inter-dependence between the network infrastructure and

the applications and services that are provided through it, network operations teams need

to extend their oversight from purely network infrastructure to applications, voice, and

other services as well. This extension requires a deeper understanding of new technologies

formerly managed by other teams. The embedding of services into network equipment has

also increased the range and complexity of devices that must be managed by the network

teams, which further complicates their roles.

Impact on Network Management Software Requirements

The evolution of converged networks and network operations challenges requires

management solutions able to maintain the connection between the business and both

internal and external customers. Strategic use of network management capabilities can

minimize potential accountability problems between network managers and application/IT

managers. One group needs to account for network performance issues, while the other is

responsible for the health of the applications deployed across the network and for satisfying

internal customers through the application user experience.

The ability to provide a multitude of reports and statistics on network status is essential.However, simple and user-friendly network management tools — with high-level alert

dashboards and features that enable users to drill down to application and network issues

— are gaining acceptance and will ultimately be accounted for in the IT budget.

Converging networks in both enterprise and service providers will force network and IT

managers to view network conditions on a per-application, per-flow basis. New

opportunities are already in action in wireless local area network (WLAN) management,

security management, VoIP management, and network configuration. Automation and

visibility tools for managed service providers will be critical in offering services across

multiple networks to multiple offices. This need is further amplified by the tightly knit

supply chain within information networks and the increasing trend of distributed and mobile

workforces.

Operations teams charged with the responsibility to maintain converged networks face the

following critical challenges:

Heterogeneous Networks Networks are now composed of a broad range of

technologies and vendors, including data technologies like IP, ATM, FR, and broadband;

as well as voice infrastructures comprised of both legacy TDM infrastructures, pure-play

IPT infrastructures, and hybrid infrastructures. Most migration to VoIP occurs gradually;

therefore, the need to simultaneously manage both legacy TDM and IP telephony

environments still exists.

Scale Today’s voice and data infrastructures are vast and span wide geographic areas —

across time zones, countries, and even continents. The management system must be

able to scale to support these very large infrastructures and provide the requiredinformation to the operations teams, regardless of whether they are located centrally or

regionally.

Required to Manage Data Inf rastructures, Voice Infrastructures, and Critical

Systems The management system must span domains to allow IT team members to

smoothly manage technical domains which were previously managed as silos.

Required to Communicate QoS to a Variety of Constituents Operations teams need

to be able to communicate to external customers through SLAs, and internal constituents




through OLAs or less formal mechanisms. Therefore, management software must contain

the intelligence and capabilities to do the following for both technical and business-

minded audiences:

› Assess the infrastructure’s ability to support the business.

› Identify problems as they occur.

› Pinpoint the source or cause of the problems.

In response to these trends, the worldwide network availability market is growing rapidly.

Delivering effective network management software will provide organizations with the tools

that they need to evolve into more efficient structures.




(This page intentionally left blank)




Chapter 3: CA’s Network and

Voice Management Solution

EITM: CA’s Vision

EITM is CA’s vision for how to dynamically manage and secure IT environments, enabling

organizations to realize the full potential of IT as a source of business value. EITM provides

a common foundation for the integration and sharing of services and data that allows for

the orchestration of all IT assets and resources in unison (infrastructure, applications, and

business processes). This business-oriented approach also makes it possible to integrate

the management of networks, systems, storage, databases, applications, and security as

well as to provide a way to measure, optimize, and demonstrate the impact of IT on the

organization’s goals as never before.

CA is the worldwide leader in management software solutions. We have been in the

management software business for three decades, and have been focused on providingsolutions for all areas of infrastructure management. We are committed to helping

organizations achieve their goals by reducing IT costs to optimize capital and operating

expenses, mitigating risk, achieving compliance, and helping to ensure that the

infrastructure is always available and performing optimally.

The core of CA’s approach is to deliver management solutions that provide a unified view of

all assets and operations of the organization as they relate to business activities and needs.

This view enables organizations to align IT with business, enabling them to make better,

more informed business decisions about how to direct business activities and utilize assets.

CA’s management solutions include Service Availability, which helps IT departments deliver

consistently superior IT services by implementing proactive, integrated management that

provides insight into the health of all systems and applications on which each business

service depends.




The Value of CA’s Network and Voice Management Solution

For today's businesses to be able to compete, high-performance voice and data solutions

are essential. CA’s vision for converged voice and data management supports the need for

IT to proactively manage end-to-end voice and data services, ensure adequate capacity of

bandwidth and systems, and support service levels defined by the business. CA’s Network

and Voice Management solution spans both IP and legacy technologies, enabling companies

to migrate to IP telephony at their own pace, and reduce the complexity of managing

heterogeneous infrastructures. Our product strategy supports this vision through

continuous updating, innovation, and integration as defined by our customers.

CA’s Network and Voice Management solution provides the following key points of value:

Effective Service Level Management This solution enables you to baseline, assess,

and track services through the network and communicate adherence to SLAs and OLAs to

business and technical audiences.

Proactive Service Assurance It provides a policy-based approach to monitoring

degradations in service which gives you the ability to identify degradations before

customers are impacted, and to account for these degradations in RCA and eventcorrelation.

Rapid Problem Detection This solution focuses on resolving the true cause of the

problem — not the symptom — through a combination of event correlation, RCA, and

linkage with real-time and historical reporting.

Predictive Capacity Planning The foundation of this solution is intelligent, embedded

algorithms that inform network operations teams exactly when to upgrade or downgrade

circuits or other hardware based on past usage trends and tailored thresholds.

Our network and voice management strategy is four-fold:

Enable our customers to manage their business processes, as well as the data, voice, and

multimedia services consistent with their competitive strategy.

Enable our customers to manage their converged networks, including business critical

applications and the transition from traditional to IP telephony.

Provide end-to-end fault and performance management for data, voice (TDM and IP

telephony), and system and application infrastructures.

Extend management to the voice and multimedia resources as well as the network

infrastructure.

CA is the only vendor who can provide the following:

An integrated, proactive management solution

A solution that spans both IP and legacy technologies

Solutions that help ensure voice network performance before, during, and after a

migration to VoIP

With CA’s Network and Voice Management solution, the IT organization becomes more

proactive — not reactive — in their approach to managing voice and data. For example,

instead of waiting to receive customer complaints about poor voice quality before acting, IT

staff will be alerted when policies indicating jitter, low mean opinion scores (MOSs), or

hardware problems such as T1 circuits have exceeded user-defined thresholds. When a

problem occurs in an IP telephony environment, it is sometimes difficult to determine the




cause. For example, a user may complain about not being able to make a call, but many

alarm events can be generated by routers, IPT systems, switches, etc. CA’s Network and

Voice Management solution provides a proactive approach to managing IP telephony and

VoIP.

Capacity planning is essential to converged voice and data management. Data collected

from voice systems, whether legacy or IPT, and from the network (including trunks andport capacity) will help the IT engineering team determine if underutilized resources are

reaching a capacity threshold. This information can also be used as the basis for predictive

capacity planning for the network. IT departments are tasked with providing service levels

to key constituencies, especially revenue-generating business units such as contact centers

The ability to monitor and report on GoS and QoS for calls in the voice network is essential

to service level management.

A Key Part of CA’s EITM Vision

CA’s Network and Voice management solution fits within CA’s company-wide EITM strategy.

The solution addresses the four major CIO imperatives in the following ways:

Improves service by providing proactive service assurance to detect problems before they

impact end users. Ensures reliability and responsiveness of the network infrastructure

with powerful RCA, event correlation, and impact analysis.

Manages risk by assuring business continuity as it enables organizations to comply with

regulatory and governance requirements.

Manages costs by significantly reducing cost of downtime (outage or degradation) as it

minimizes the number of occurrences and the duration of downtime.

Aligns IT with business by giving the IT team a business view that provides status of end-

to-end business service.

Network and Voice Management for Key Vertical Markets

CA provides powerful and unique management software for managing increasingly complex

network services within traditional enterprise and government environments, as well as

telecommunications, cable, mobile wireless, and other service provider industries. It

enables technical teams to improve service, control costs, reduce risk, increase revenue,

and drive efficiency when managing IT infrastructure as a business service.

Telecommunication Service Providers

CA views the telecommunication service-provider industries as an important vertical

market, the members of which leverage their operational environment as a key part of

controlling costs and delivering product/service differentiation — crucial factors to

remaining competitive in today’s challenging communications marketplace.

Most service providers select a variety of management tools specific to element

management, service provisioning, billing, customer care, and service assurance. Within

large carrier and service provider environments, this creates many disparate applications

and data stores. They must somehow be integrated to form efficient workflow processes

that ensure services are delivered reliably while maintaining operational efficiencies. CA’s

Network and Voice Management solution provides flexible integration points to allow it to

function successfully in heterogeneous Operational Support System (OSS) software

environments, while also reducing deployment time and complexity, as well as




The Components of the Solution

CA’s solution for this market is comprised of three primary product lines, all of which

provide a consistent set of capabilities across the voice and data infrastructure:

e Health assesses the health of your network and determines if it can accommodate

voice. It tells you how well the network is performing, allows you to compare voice MOS

with QoS statistics, and identifies trends in performance.

SPECTRUM provides fault management, RCA, and voice modeling. It also enables you to

model the components of your voice services to ensure that services are operating

smoothly.

e Health for Voice offers system performance management for communication systems

(IP and Traditional TDM), messaging systems, and management from a telephony

perspective.

These applications can work as standalone systems or integrated, as shown below.

eHealth

eHealth helps you take control of network performance and ensure QoS across the entire

network infrastructure. It enables you to successfully accomplish a multitude of tasks such

as ensuring the availability and performance of the network, documenting service levels,

managing capacity, and accurately planning for growth. This solution allows you to face a

number of challenges including managing a diverse collection of devices from numerous

vendors, isolating the source of performance degradation throughout the network,

minimizing recurring wide area network (WAN) expenses, and providing consistent

reporting across your heterogeneous network infrastructure.

This component of the solution enables you to achieve the following goals:

Improve the service availability of your network.

Reduce cost of downtime and end-user impact caused by downtime.




Identify and resolve problems faster.

Plan for capacity before it is needed.

Improve QoS.

Meet and prove committed service levels.

e Health ComponentseHealth includes the following:

eHealth

E2E Console

eHealth

Live Health

eHealth

Traffic Accountant

Report Center

Distributed eHealth

eHealth

SPECTRUM

Integration

eHealth

Universal Workflow Integration Modules (HP OpenView, IBM (Micromuse),

Netcool, Cisco CIC)

eHealth

Universal Data Integration Modules (Cisco WAN Manager, Cisco IP Solution

Center, Lucent, Nortel, Alcatel)

eHealth

Universal Wireless Integration Modules (Nortel, Starent)

The Benefits of e Health

eHealth can be differentiated from other solutions in the following ways:

eHealth offers best-in-class proactive management so that IT can correct problems before

they become revenue-impacting issues.

It has the broadest multi-vendor support — over 1000 devices from 100 different

vendors.

Its reports have built-in intelligence to troubleshoot without requiring intimate knowledge

of every component of the service.

It provides auto-baseline; that is, eHealth will learn the normal behavior for each

management device. The “deviation from normal” algorithm offers a more reliable

threshold relative to history because the window of comparison is continuous.

Unlike CA, many vendors have a limited portfolio of device and technology support, which

poses a problem to companies trying to reduce the number of management software

vendors.

› Customers want greater accountability from their vendors as IT becomes more of

a service.

› Fewer vendors results in less complexity, better operating costs, and less risk in

rolling out new business initiatives.

The value of an integrated management platform is significant because IT operations

spend a large amount of time and money identifying the cause of problems, and

downtime is extremely expensive.




The Benefits of e Health for Voice

eHealth for Voice offers the following benefits:

Enables you to manage your business processes, including voice and multimedia services,

consistent with your competitive strategy.

Enables you to manage your transition from traditional to IP telephony. Provides end-to-end fault and performance management for IP telephony networks.

Manages voice and multimedia resources to reduce risk, manage costs, enable new

services, and align your investments with your IT objectives.

CA Technology Services Network and Voice Management Service

Offerings

CA Technology Services has specialists in eHealth and SPECTRUM to help organizations

assess, design, implement, and optimize network and voice availability and performance

solutions across the enterprise. From financial services companies to telecommunications

companies to government organizations and beyond, CA experts help you establish best-practices workflows, integrate your network management solutions, and combine your

network and voice availability and performance solutions with your service desk for a

consolidated network event management system.

The focus of CA network and voice availability and performance experts is to help you

achieve the following:

Improve business alignment by mapping the network infrastructure to critical IT

services that support the business, and ensuring that your network team is focused on

delivering the organization’s most important services.

Increase business planning capabilities by delivering full visibility across the network

infrastructure through consolidated consoles, reports, and metrics analysis.

Reduce risk by defining and implementing automatic repair responses that avoid the

possibility of human error and guarantee problem repair consistency.

Reduce cost by consolidating network event management into a central point of control

which decreases staffing demands.

Leverage value from existing network management systems by integrating and

building upon your prevailing tools and workflows.

Optimizing IT service delivery by applying International Organization for

Standardization (ISO) and CobiT standards, IT infrastructure library (ITIL) best practices,

and proven network management processes.

LIFECYCLE APPROACH FOR NETWORK AND VOICE AVAILABI L ITY A ND PERFORMANCE

SOLUTIONS

The needs of every organization are unique, but network management yields common

themes in workflow processes and monitoring, and management instrumentation. A CA

solution is deployed and optimized to the particular needs of your organization through a

lifecycle of best practices services offerings.


http://www3.ca.com/technologies/CollateralList.aspx?CCT=19505&ID=5755

http://www3.ca.com/technologies/CollateralList.aspx?CCT=19505&ID=5755



Assessment – Understanding the Gaps

Comprehensive assessments validate the maturity and efficiency of network and voice

availability and performance management. CA experts conduct a comprehensive analysis of

your network management capabilities including the following:

Network management goals, objectives, capabilities, and strategies

Network operations organization structure, and personnel roles and responsibilities

Network monitoring, configuration, and integration software

Network design and topology

Voice traffic simulation Data analysis

Associated security constraints (firewalls, access lists, and so on)

Alarm/event severity definitions

Existing business, technical, and environmental challenges and issues

Change control processes

CA and third-party product integration requirements

Your current management capabilities are compared to the CA maturity model for people,

processes, and technology and the assessment results in a Solution Architecture Overview

(SAO). The SAO is a blueprint that defines achievable solution phases to maximize problem

determination and response workflows, apply automation, and integrate service desk

operations. CA consultants and architects also research and map the network infrastructure

to IT services, propose recommendations, and furnish business justifications to help you

secure funding.




Implementation – The Bottom Line of Solution Success

Using the SAS as a guide, CA consultants prepare the environment; install, configure, and

customize eHealth and SPECTRUM; verify and document your eHealth and SPECTRUM

solutions on test, QA, and production environments; and provide knowledge transfer to

your staff. Implementation services also include the development and deployment of

integration components between eHealth and SPECTRUM, your other IT management

applications, and your service desk.

To ensure that implementation efforts are tightly managed, PMP-certified Project Managers

track and report on progress, questions, issues, and roadblocks. CA Technology Services

uses PMP-certified Project Managers and highly trained architects, consultants, and

partners. On an annual basis, CA Technology Services invests 50% more in training our

professionals than the industry average.

Optimization – Anticipating Change

Optimization services evaluate ways in which your existing eHealth and SPECTRUM

solutions can be further utilized or fine-tuned. Healthcheck services can include tuning and

reconfiguration, upgrades, and migrations. Other services include training and certifications

that focus on increasing staff efficiency. Past experience has found that staff training results

in more efficient operations. These services are offered as onsite or offsite instructor-led,

self-paced, or web-based. Instructors or course developers are also certified experts and

dedicated to network and voice availability and performance.

Why Trust Your Service Availability to CA Technology Services?

Experience: CA has 30 years of enterprise systems management services experience.

Proven Process: A dedicated assessment team plans, designs, and provides business

justification for network and workflow recommendations and builds best practices into

every customer blueprint.

Expertise: A vibrant community of worldwide professionals focused on network, voice

availability, and performance shares their solutions knowledge and continually contributesproven best-practice workflows and solution models.

Focus: CA Technology Services is comprised of a team of Solution Managers and

dedicated architects who are devoted exclusively to the assessment, design, delivery, and

workflow methodologies offered around eHealth and SPECTRUM services and solutions.




Proactive Service Assurance

CA’s Network and Voice Management products are used together for proactive service

assurance through the embedded algorithms within all of the product lines. This helps

operations teams identify potential problems BEFORE they impact customer service.

Within eHealth, this is accomplished primarily through the Time over Threshold and

Deviation from Normal algorithms within Live Health, which allow an intelligent

performance-based alert to be sent when current performance violates either a fixed

threshold, or what is considered “normal” behavior (based on past history) for a particular

length of time within a given analysis window. Similarly, eHealth for Voice sends alerts

when violations of QoS or GoS are experienced. These alerts are fed into SPECTRUM, which

applies its intelligence on policy, models, and rules to identify the severity of the problem,

provide alarm integration and correlation, taking advantage of the SPECTRUM Service

Management and voice modeling capability.




Chapter 4: Deployment

Architecture for Network and

Voice ManagementThis chapter provides information to prepare for the installation and configuration of CA’s

Network and Voice Management solution. The following key topics are presented:

Network performance components

Network fault management components

Voice management components

Deployment architectures

Sizing recommendations

Network Performance Components

eHealth is comprised of the following components:

Required Components Optional Components

Live Health

Integration Modules

Distributed eHealth

Remote Poller

Report Center

E2E Console

Traffic Accountant

E2E Console

The E2E Console is the core of an eHealth implementation and is required to operate

eHealth. The E2E Console includes database, discovery, and poller functionality along with

administration GUIs, reporting GUIs, and so on. eHealth licenses (universal and system)

enable the eHealth Console to poll and collect data from certified devices with an embedded

management software agent, and are required to operate eHealth. An element representsthe eHealth model, or representation, for any part of an infrastructure that eHealth can

analyze. eHealth can analyze a physical element, such as a specific port on a specific card

of a specific router. It can also analyze a logical element, which refers to the logical purpose

for a device or component, such as a network link. To determine if a device is certified for

use with eHealth, log on to the Certification pages at http://support.concord.com.

Note: You must have a Support account to access the http://support.concord.com site. You

obtain an account with the purchase of the eHealth or SPECTRUM products.




Live Health

Live Health is the real-time performance monitoring engine that analyzes performance data

collected with eHealth for deviations from normal behavior and threshold violations. Live

Health includes three components:

Live Exceptions gives you the ability to generate and display performance-based alarms.

Live Status provides a single end-to-end view of the status of your infrastructure.

Live Trend provides a real-time reporting capability.

Integration Modules

eHealth provides a set of integration modules (IMs) that enable you to use eHealth to

report on the data that various network management systems (NMSs) collect. When you

license and install an IM, you can use it to tap into data already collected by a present NMS

You can then import it “en masse” into eHealth, providing critical data quickly and

eliminating the need for redundant data gathering via duplicate polling.

Universal Workflow IMs enable customers to drill back from supported fault

management systems to eHealth. This type of IM is supported on the following systems:SPECTRUM, IBM (Micromuse) Netcool, Cisco Information Center (CIC), and HP OpenView

Network Node Manager.

Universal Data IMs enable the import of configuration and performance data. This type

of IM is supported on the following systems: Cisco WAN Manager, Cisco ISC, Lucent,

Alcatel, and Nortel.

Universal Wireless Data IMs enable import of configuration and performance data

from other wireless element management systems into the eHealth E2E Console. This

type of IM is supported on the following systems: Nortel Shasta SCS GGSN and Starent

ST-16 Bulk Stats.

Distributed e Health

If you have a large infrastructure, you could deploy multiple Distributed eHealth Systems

across different physical locations or alternatively co-locate them in a central configuration

referred to as a cluster. The cluster contains several eHealth systems that manage specific

sets of resources, and share the information with each other. By using Distributed eHealth,

you can distribute the workload of collecting and processing data across multiple eHealth

systems that work in parallel. Report users can access reports for any element or groups in

the cluster from Distributed eHealth Consoles, which are reporting front-ends to the cluster

You would typically choose a Distributed eHealth site when you want to run reports for

more elements than a standalone eHealth system can support. You might also choose a

Distributed eHealth site if you want to place an eHealth web server system outside the

firewall and insulate the Distributed eHealth Systems within the firewall of your

infrastructure. Depending on the number of Distributed eHealth Systems that you have,and the system performance of the Distributed eHealth Console, a Distributed eHealth site

could support reports for up to one million elements.

The Distributed eHealth Package software contains all software for Distributed eHealth

Consoles and the software required to turn a standalone eHealth System into a Distributed

eHealth System. You must purchase all console software, elements, and agents for the

standalone eHealth systems separately. For complete instructions on administering a

cluster, see the Distributed eHealth Administration Guide.




Network Fault Management Components

SPECTRUM is comprised of the following components:


Assurance Server

Assurance Server Xsight

Assurance Server Integrity

Assurance Server Infinity

OneClick

Watch Editor

SNMP V3

Alarm Notification Manager


Frame Relay Manager

Configuration Manager

VPN Manager

ATM Circuit Manager

Report Manager

Multicast Manager

Service Performance

Manager

QoS Manager

Service Manager

Assurance Server

SPECTRUM offers three types of Assurance Servers designed for different types of

customers:

Assurance Server Xsight (for emerging enterprises)

Assurance Server Integrity (for larger enterprises)

Assurance Server Infinity (for Service Providers)

ASSURANCE SERVER XSIGHT

The Assurance Server Xsight delivers the capabilities of core SPECTRUM technologies to abroader array of small businesses. With the introduction of SPECTRUM Xsight, CA extended

the support of multi-vendor IP fault and performance management in a solution that is

competitively priced and packaged to help you become operational quickly. This component

provides support for most vendor devices found in today’s enterprise networks. It supports

single-server deployment only; it does not allow for a distributed deployment.

The Assurance Server Xsight includes the following key features:

Root cause analysis

Impact analysis

Auto-discovery of multi-vendor and multi-technology networks

Standards-based integrations

One concurrent administrator license (fault-tolerant license not included)




ASSURANCE SERVER INTEGRI TY

SPECTRUM’s roots lie within serving large enterprise customers whose businesses change,

merge, and scale rapidly. The SPECTRUM Integrity solution has transformed CA’s patented

technologies and combined them with new features and functionality to offer today’s

evolving enterprises the power to manage business-critical services across the hall or

around the world. This component provides support for most vendor devices found intoday’s enterprise networks.

The Assurance Server Integrity includes the following key features:

Root cause analysis

Impact analysis



One concurrent administrator license and a fault-tolerant license

ASSURANCE SERVER I NFINI TY

SPECTRUM Infinity is specifically focused on the needs of today’s service providers. It

provides specific functionality with significant performance improvements dedicated to

accelerating new service rollouts and exceeding customer quality expectations, while

allowing them to manage a growing infrastructure with existing resources. This component

provides Integrity device support and the Advanced Management Module pack that provides

support for high-end devices typically found only in service provider networks.

The Assurance Server Infinity includes the following key features:

Root cause analysis

Impact analysis



One concurrent administrator license, a fault-tolerant license, and two Southbound

Gateway integration licenses

OneClick

SPECTRUM OneClick is a three-tier, web-based console. The central component is a web

server that connects directly to SPECTRUM Assurance Servers and delivers information to

distributed Java clients. The feature-rich Java clients are downloaded, installed, and

updated from the OneClick web server to ease implementation, administration, and

maintenance. The SPECTRUM OneClick console combines anywhere/anytime access and

reduces the training requirements of standard web-based applications with the scalabilityand responsiveness of a full desktop client application.




The Alarm Notification Manager includes the following key features:

Alarm consolidation

Alarm filtering

Policy-based alarm forwarding

Alarm notification

Frame Relay Manager

SPECTRUM Frame Relay Manager delivers precise monitoring and performance thresholding

of committed information rates (CIR), bandwidth utilization, and circuit congestion. Root

cause analysis and fault isolation is provided per data link connection identifier (DLCI), with

impact analysis to prioritize response and corrective action. Patented intelligent auto-

discovery techniques leverage remote IP address information and traffic statistics to map

DLCI connectivity and present an integrated topology view. Several large enterprises use

SPECTRUM Frame Relay Manager to document SLA violations with their service providers.

SPECTRUM’s Frame Relay can also determine if an enterprise has purchased too much or

too little bandwidth on a per-circuit basis. This results in cost savings of thousands and tensof thousands of dollars per month in WAN connectivity charges. Service providers have

used SPECTRUM Frame Relay Manager to ensure SLA compliance, improve customer

service quality, and deliver differentiated service offerings. Using this component, one

service provider is able to identify Frame Relay problems in 97 to 99% of cases before their

customers do — and are working to fix them before the customer’s business is impacted.

This component provides a cost-effective way to improve service quality, deliver end-to-end

visibility, and reduce operating costs.

The Frame Relay Manager includes the following key features:

Proactive communication with Frame Relay equipment that supports RFC 1315 or RFC

2115 Frame Relay MIBs with vendor extensions for Cisco and Nortel

Fast, accurate modeling of physical and logical DLCI port connectivity with IP address,

subnet mask, and remote IP address information

Out-of-box performance views show CIR throughput, congestion statistics, and data

terminal equipment (DTE) changes

ATM Circuit Manager

SPECTRUM ATM Circuit Manager delivers precise monitoring and performance thresholding

of ATM throughput, bandwidth utilization, and circuit congestion. Root cause analysis and

fault isolation is provided per virtual private LAN (VPL)/virtual channel links (VCL) with

impact analysis to prioritize response and corrective action. Patented intelligent auto-

discovery techniques leverage remote IP address information and traffic statistics to map

virtual path identifier (VPI)/virtual channel identifier (VCI) connectivity and present anintegrated topology view. The ATM circuit path view displays the endpoint-to-endpoint

mapping for each device, physical port, and logical interface traversed.

Enterprises can also import a list of permanent virtual circuits (PVCs) provided by their

service provider to accurately model all ATM WAN links. Several large enterprises already

use SPECTRUM ATM Circuit Manager to document SLA violations with their service provider.

In addition to this capability, SPECTRUM’s ATM can also determine if an enterprise has

purchased too much or too little bandwidth on a per-circuit basis. This can result in a cost




savings of several thousands of dollars per month in WAN connectivity charges. Service

providers have used SPECTRUM ATM Circuit Manager to ensure SLA compliance and deliver

differentiated service offerings. This component provides a cost-effective way to improve

service quality, deliver end-to-end visibility, and reduce operating costs.

The ATM Circuit Manager includes the following key features:

Proactive communication with ATM equipment that supports RFC 1695 with private

management information bases (MIBs)

Fast, accurate modeling of physical and logical VPL/VCL port connectivity with IP address,

subnet mask, and remote IP address information

Out-of-box performance views showing cells-per-second throughput and ATM QoS

information

Multicast Manager

SPECTRUM Multicast Manager provides multi-vendor visibility into logical multicast network

sessions — proactively monitoring key performance indicators while highlighting the impact

of infrastructure outages on multicast services. All logical multicast overlay services are

automatically discovered and modeled within the SPECTRUM Assurance Server. Multicast

session models maintain complete knowledge of the multicast feed including its source,

distribution tree, and receivers.

SPECTRUM Multicast Manager presents the user with an easy-to-use interface for topology

navigation and alarm monitoring. This results in lower training and administration costs as

users have at-a-glance access to actionable information. Multicast enhancements allow the

user to view the per-group multicast topology and the associated routers, switches, and

ports that comprise the IP multicast group. SPECTRUM Multicast Manager also monitors

multicast group health. If a resource in a multicast group (source, routers, switches, ports)

experiences a reliability problem, SPECTRUM Multicast Manager will automatically

understand the impact on the overall group. This component provides a cost-effective way

to manage your multicast infrastructure as a business service.

The Multicast Manager includes the following key features:

Multi-vendor view of IP services with a detailed understanding of the elements that

comprise a multicast group

An intuitive interface for multicast topology navigation and alarm monitoring of groups,

sources, receivers, and rendezvous point (RP) devices

QOS Manager

The SPECTRUM QoS Manager enables enterprises and service providers to verify and

validate the configuration and effectiveness of QoS Policies and Traffic Classes throughout

the IT infrastructure. Technology Relationship Mapping and web-based reporting discoversand documents the health and performance for each CoS configured across the network.

Patented SPECTRUM analytics intelligently integrate and automate modeling of your QoS

Policies and Traffic Classes to deliver RCA and impact prioritization.





Today’s complex and distributed networks are being driven by security policies that inhibit

the use of insecure management protocols such as Simple Management Network Protocol

(SNMP) v1 or Internet Control Message Protocol (ICMP) for management of those networks

For example, a Demilitarized Zone (DMZ) separates a set of elements from the intranet

through a firewall for security purposes. Still, the businesses, processes, and customersneed to be supported by IT services. This requires visibility into the complete infrastructure

SPECTRUM Secure Domain Manager (SDM) enables customers to manage those domains by

securely tunneling SNMP and ICMP traffic through a secure sockets layer (SSL) connection.

Only a single hole needs to be inserted into the firewall, allowing for extended

manageability without impacting security policies in place. This solution is totally

transparent to the end user and all client applications, eliminating the need to perform

additional administrative tasks.

Note: This feature is not available for the Assurance Server Xsight.

The Secure Domain Manager includes the following key features:

Multiple secure domain connectors

SNMP and ICMP traffic forwarding

Securely tunneled traffic via XML/SSL over Transmission Control Protocol (TCP)

Transparency to users and client applications

Configuration Manager

Managing today’s complex infrastructures involves maintaining hundreds or thousands of

business-critical devices. Being able to keep track of how they are all configured — and

making sure that configurations are accurate — can be overwhelming. SPECTRUM

Configuration Manager is an intelligent, integrated application that automates management

of critical device configurations to keep your business operational. SPECTRUM Configuration

Manager provides the tools that you need to capture, modify, load, and verifyconfigurations for thousands of multi-vendor devices. With its unique design, SPECTRUM

Configuration Manager allows users to perform device administration on configuration files,

MIB object identifiers (OIDs), and SNMP attributes. Each configuration is time-stamped and

identified by the revision number. SPECTRUM-specific values such as polling interval,

community name, or security string can be edited.

SPECTRUM Configuration Manager can quickly load any stored configuration to single or

multiple devices simultaneously — tracking all changes, scheduling automatic uploads

during maintenance windows, or rolling back configurations to their last known good state.

Automatically scheduled configuration comparisons deliver immediate notification of

unauthorized changes. This component provides cost-effective configuration management

to ensure business continuity.




Voice Management Components

The eHealth for Voice product is comprised of the following components:


eHealth for Voice

Right to Use License (per PBX/message system)

Node License (1 per call/message system)

eHealth for Voice Policy

Manager

e Health for Voice

eHealth for Voice is a multi-vendor, multi-system (call management and voice messaging)

and multi-technology (traditional PBX (TDM) and IPT) performance management solution

that greatly simplifies management of voice networks. eHealth for Voice eliminates the

manual collection of data and the labor-intensive effort of report compilation and telephony

GoS determination. This translates to improved voice system performance and availability

delivered at a lower cost. Furthermore, eHealth for Voice is an agent-less solution that does

not require any software to be installed on the voice systems, simplifying installation and

greatly reducing time-to-value.

You can run a wide variety of reports for delivery to printers, email recipients, or a

corporate intranet. With eHealth for Voice, accurate current and historical system

information is always available for trending and analysis. Instead of fragmented data

snapshots, true performance measurements are delivered to a desktop or printer, every

day, automatically.

The eHealth for Voice architecture allows maximum scalability by modularizing functions.

For smaller installations, a single server may contain the database and the data collection

module. For larger applications, any number of additional servers in different locations may

act as data collection agents, downloading data from clusters and sending it to the one

central database. Data can be collected according to a user-defined schedule, twenty-four

hours a day, seven days a week, so that all data is retrieved before it is overwritten. The

central database may be accessed by any number of client machines over an IP network to

provide access to data and reports.

You can purchase eHealth for Voice by ordering the eHealth for Voice Right to Use license

for the PBX, call system, or messaging system to be monitored (for example, purchase CA

eHealth for Voice – Nortel CS-1000 and Meridan to monitor Nortel PBXs) and then order the

appropriate number of node licenses. One node license is required for each call system or

messaging system monitored.

e Health for Voice Policy Manager

eHealth for Voice Policy Manager is a component that plugs into the eHealth for Voice

engine to monitor all data activity against user-defined criteria and provide automatic

notification when those criteria are met. The module allows you to set specific thresholds

and conditions at the node, platform, or system-wide level and to set notification actions

including sending e-mail, console, and pager messages, SNMP traps to SPECTRUM,

Unicenter NSM, or third-party monitoring systems, and invoking customized commands and




Technical Specifications for SPECTRUM Report Manager and OneClick Servers

UNIX/Linux Windows

Minimum SystemRequirements

Sun SPARCstationLinux – Pentium Xeon

Pentium Xeon

OperatingSystems Solaris 9, 10 (see installationguide for required patches) Windows 2000, Windows XPProfessional, or Windows 2003Server

Note: Business Objects XI supportsa maximum of 10 users on XP.

Linux Red Hat Ver 3, update6 or greater

Memory 1 GB (with 2GB swap spacefor Solaris)

1 GB

Free Disk Space 4 GB 4 GB

Applications Linux Update 6 or greater

Solaris packages

SUNWeu8osSUNWeuluf

See Microsoft Support forinformation about updates for yourWindows version.Business Objects XI Service Pack 1

Technical Specifications for SPECTRUM OneClick Servers(without Report Manager)

UNIX/Linux Windows

Minimum SystemRequirements

Sun SPARCstationLinux – Pentium Xeon

Pentium Xeon

Operating

Systems

Solaris 9, 10 (see installation

guide for required patches)

Windows 2000, Windows XP

Professional, or Windows 2003Server

Linux Red Hat Ver 3, update6 or greater

Memory 1 GB 1 GB

Free Disk Space 230 MB 230 MB

Applications Linux Update 6 or greater

Java 2 SDK, Standard Edition,version 1.5.0_06 or later

Windows 2000 - Service Pack 2 orlater Java 2 SDK, Standard Edition,version 1.5.0_06 or later




Installation Steps

You can install the software applications in any order; as a best practice, install SPECTRUM

and SPECTRUM OneClick/Report Manager first, then eHealth, and, finally, eHealth for Voice.

Important: The CA network and voice management applications require you to use

systems that are dedicated to each application. Do not use those systems for other

applications or services. While anti-virus and security software are recommended for any

server system in your environment, disable the anti-virus software during installation to

ensure that the applications install completely.

You need a minimum of four systems for the following basic configuration:

SPECTRUM (SpectroSERVER system)

SPECTRUM OneClick and Report Manager server

eHealth

eHealth for Voice

BEST PRACTICESTo facilitate the successful installation and setup of these components, review the following

best practices:

Ensure that the systems on which you plan to install the software have fixed IP

addresses.

Obtain and test login account privileges to the systems. For Windows systems, you need

an account with Administrator privileges. For UNIX systems, you need access to the root

user account.

How You I nstall SPECTRUM

On the system that you have designated as the SpectroSERVER for your environment,

install SPECTRUM Release 8.0. Log on to http://support.concord.com to obtain the latestService Pack for the release from the Software Downloads page.

Follow the instructions in the SPECTRUM Installation Guide to complete the following tasks:

1. Confirm SPECTRUM prerequisites.

2. Prepare the operating system and optimize the system for best performance.

3. Make sure that you have SPECTRUM license and extraction keys. You obtain the keys

from your CA sales representative when you purchase the software.

4. Install the software and perform any necessary troubleshooting for the installation.

5. Start the SPECTRUM software.

6. Enable access to the SPECTRUM system.

Following the SPECTRUM installation, proceed to the SPECTRUM OneClick installation.

Install OneClick so that you have full administrative access to SPECTRUM.

Note: OneClick is the primary administration interface to SPECTRUM. Use OneClick, rather

than the legacy SpectroGRAPH interface, to perform administrative functions.




How Y ou Install SPECTRUM OneClick and Report Manager

On the system that you have designated as the OneClick and Report Manager server for

your environment, install SPECTRUM OneClick and Report Manager for Release 8.0.

Follow the steps in the Report Manager Installation and Administration Guide to complete

the following tasks:

Important: To install the OneClick and Business Objects software, follow the

documentation carefully. Installation failures typically result if you diverge from the

documented steps. Install Business Objects first and select the option to use an existing

Java application server. In the event of a failure, you will have to remove and reinstall

OneClick and Report Manager.

1. Confirm OneClick prerequisites and system requirements.

2. Prepare the operating system and optimize the system parameters for best

performance.


4. Install the OneClick client to run the application and confirm that you can connect to

the SpectroSERVER system.

5. On the OneClick interface, click the Report Manager tab on the OneClick index page to

confirm that Report Manager installed correctly.

How Y ou Install e Health

On the system that you have designated as the eHealth server for your environment, install

eHealth Release 6.0. Make sure that you log on to http://support.concord.com to obtain the

latest InstallPlus kit for the release from the Software Downloads page.

Follow the instructions in the New Installations of eHealth 6.0 Guide for your system

platform (Windows or UNIX) to complete the following tasks:

1. Confirm system prerequisites and locations for the eHealth and embedded Oracle

software.

2. Install the eHealth and Oracle software, and perform any necessary troubleshooting for

the installation.

3. Make sure that you have the eHealth licenses for the features that you will be using.

You obtain these licenses with the purchase of the eHealth products. For eHealth

release 6.0 GA, note that you need an eHealth SPECTRUM Integration license to

configure and use the integrated solution.

Important: After you complete the eHealth installation, follow the instructions provided in

the section “Configuring the Integrated Solution” in this chapter. Do not follow the

instructions to start the eHealth console and begin discovering your resources as elements.

For the integrated solution, you will discover eHealth elements by importing the SPECTRUM

configuration from the SpectroSERVER system. This simplifies the administration tasks for

eHealth discovery. For a description of the eHealth administration tasks and interfaces, see

the eHealth Administration Overview Guide.




How Y ou Install e Health for Voice

On the system that you have designated as the eHealth for Voice server for your

environment, install eHealth for Voice Release 4. You can install eHealth for Voice in its

entirety on one PC. You can also install the software in a distributed configuration on

several PCs where one is the database server and the others are client systems that can

access the database server for reports and administration tasks.

The Database Manager server requires a Microsoft SQL Server database engine. You must

purchase and install Microsoft SQL Server before installing eHealth for Voice. You can

typically accomplish this by installing Microsoft SQL Server 2000 on the PC which is to hold

the eHealth database.

Note: Microsoft SQL Server is required only on the PC that will contain the eHealth for

Voice database (the Database Manager installation); it is not required on agent-only or

client-only machines.

Follow the instructions provided in the eHealth for Voice Operations Guide to complete the

following tasks:

1. Confirm system prerequisites.


3. Optionally, install client-only servers to access the eHealth for Voice database server.

4. Start the Program Console and define your voice environment:

a. Install the licenses for the platforms to be supported.

b. Start the following services (at minimum):

› Task Scheduler

› Data Collector

› Data Loader

› Policy Manager

c. Define the following:

› Company

› Group

› Collector

› Platform

d. Check the Data Collection queue to verify the scheduled data collection.

5. Set up the SPECTRUM integration by following the instructions provided in the eHealth

for Voice Integration for SPECTRUM Guide.

When you complete the eHealth for Voice installation, follow the instructions to start the

Program Console and define your voice environment; then set up your SPECTRUM

integration by following the instructions provided in the eHealth for Voice Integration for

SPECTRUM Guide.




Configuring the Integrated Solution

Using SPECTRUM, SPECTRUM OneClick, eHealth, and eHealth for Voice, you can deploy an

integrated solution for managing your network and voice resources.

SPECTRUM provides the top-level management interface for resource identification, fault

management, and IT network problem resolution. SPECTRUM reduces alarm noise and

detects root causes of problems.

eHealth provides performance management by collecting detailed statistics on your

resources and analyzing that data to detect growing problems and changes in behavior.

eHealth Live Health compares performance to thresholds and service rules, and raises

alarms when resource performance starts to degrade. Health reports and Live Health can

send alarms (traps) to SPECTRUM to reflect these problems in OneClick views.

eHealth for Voice manages the end-to-end service for traditional voice networks as well

as Voice over IP converged networks. It can detect service policy violations and capacity

problems, and send alarms to SPECTRUM to alert network managers through their

OneClick views.

While these products can be used separately to manage and report on network

performance and faults, their combined capabilities provide network managers with a single

top-level view of possible problems and changes in network performance, and the

capabilities to drill down to reports for more information and troubleshooting.

Best Practices

The following sections describe the best practices for configuring CA’s integrated Network

and Voice Management solution. These practices streamline common administration tasks,

and reduce time devoted to managing and maintaining the software configurations.

To configure the integrated solution, follow these primary steps:

1. Identify the network resources that you want to manage using SPECTRUM discovery;

then create Global Collections to organize those resources.

2. Import the SPECTRUM-discovered resources into eHealth using eHealth’s discover

process.

3. Facilitate reporting and management of your resources by organizing related elements

into groups and group lists based on the relationships such as the geographic region,

customer, organization, or department that they support.

4. Schedule eHealth discoveries of Global Collections to maintain the poller configuration.

The following sections describe these steps in more detail, and provide references to

product documentation that provides complete information.

Identify Resources and Use SPECTRUM to Discover Them as Global

Collections

Use SPECTRUM discovery to identify the network resources that you want to manage, and

then create Global Collections to organize those resources into topology views. These views

help network operators track various collections of network entities, organizations, or

services that comprise your infrastructure.




Health can detect changes in behavior, identify potential problems in service degradation or

capacity, and provide insight into performance trends over time.

SET UP THE SPECTRUM IN TEGRATION

Before you can import global collections to eHealth, run the SPECTRUM setup program on

the eHealth system.

To run the e Health SPECTRUM Integration setup program

1. Log in to the eHealth system as the eHealth administrator.

2. Open a terminal window and change to the eHealth directory by entering the following

command, where ehealth is the full pathname:

cd ehealth

3. Run the setup program by entering the following command:

./bin/nhSpectrumSetup

The SPECTRUM Import Setup dialog box opens.

4. Enter the following information when prompted by the setup program:

› Hostname or IP address of the SPECTRUM OneClick server

› Port number for OneClick server Web requests

› Path where OneClick is installed on the server

› Username used to log in to the OneClick server

› Password for the specified user name

5. Click OK. eHealth verifies your settings and displays a message notifying you if they are

valid.

Note: The validation process may take a few seconds.

DISCOVER SPECTRUM GLOBAL COLLECTIONS

Use the eHealth discover process to import the SPECTRUM configuration into eHealth.

To discover a SPECTRUM Global Collection

1. Log in to the eHealth console.

2. Select Setup, Discover.

3. In the Discover dialog, do the following:

a. In the Mode list, select the technology types associated with the resources that you

want to discover.

b. Select SPECTRUM Import and specify the SPECTRUM Global Collection that you

want to import.

c. Click Discover.




eHealth connects to the OneClick server, extracts the information from the SPECTRUM

collection, and discovers the appropriate elements.

4. Save the discovered elements to the poller configuration. eHealth automatically begins

polling them to collect performance data.

Organize Your Resources by Creatinge

Health GroupseHealth provides a grouping capability that helps you to organize your elements effectively,

facilitate administration, and simplify reporting. By focusing on a subset of elements —

rather than all elements in your infrastructure — you can manage them more easily as well

as create effective reports that address specific needs. To manage your infrastructure, you

can organize related elements into groups based on geographic regions, customers,

organizations, or departments that they support. To organize your groups, you can

associate them to group lists.

For example, if you wanted to monitor the systems supporting your business within Europe,

you could create a group called England (composed of resources that support offices in that

country), and other groups for each country in which you operate. You could then add those

groups to a group list called Europe and generate reports for the entire group list. To

simplify reporting and administration, you can also filter your element lists based on your

grouping strategy. Before grouping your resources, review the eHealth best practices for

grouping outlined in the eHealth Element and Poller Management Guide.

To create a new group

1. Log in to the OneClick for eHealth console as an administrator who has permission to

manage groups.

2. Select Find Elements in the Managed Resources folder.

3. Select the elements that you want to include. Select Element Chooser to filter the list.

Include a wildcard such as an asterisk (*) to match characters, or a question mark (?)

to match a single character.

4. Right-click and select Create Group with Selected Elements.

5. Specify the first group name and a description. If SmartTree is enabled, append a label

to the group name that reflects the location of the elements and use the selected

delimiter (for example: England-1, Germany-1, or Spain-1).

6. Click OK. The group immediately appears under By Group.

7. Repeat Steps 2 through 6 to create other groups with a suffix. For example: England-2,

Germany-2, Spain-2.

8. Under Managed Resources, select By Group. If SmartTree is enabled, the element tree

displays two separate tiers in an alphabetical hierarchy based on that naming

convention.




Network and Voice Monitoring

Using CA’s integrated solution, you can closely monitor the performance of your network

and voice resources. The eHealth Live Health application provides instantaneous feedback

on trouble spots — telling you where the problems are, when they started, and their

severity. It also identifies growing problems before they become failures, allowing you to

take action and keep your business running smoothly. Live Exceptions sends alarm

notifications to the SPECTRUM interface. You can then run reports from the SPECTRUM

OneClick interface to review eHealth’s analysis of the problems.

To configure the integrated solution to monitor performance, follow these primary steps:

1. Set up Live Health monitoring of eHealth groups and group lists.

2. Forward Live Health traps to SPECTRUM.

3. Customize and schedule Health reports to send traps to SPECTRUM.

4. Configure eHealth for Voice to send alerts to SPECTRUM.

5. Configure SPECTRUM to recognize the eHealth server.

6. Configure SPECTRUM to view eHealth alarms.

The following sections describe these steps in more detail, and reference the product

documentation for complete information.

Set Up Live Health

After you discover the resources that you want to monitor and group them, you can

associate them to a Live Health profile to indicate when performance problems are

occurring. A Live Health profile is a set of alarm rules that eHealth applies to groups or

group lists of elements. Alarm rules define the types of elements and conditions to monitor,

the problem thresholds and duration, and the problem severity.

eHealth provides hundreds of technology-specific profiles for managing your network

resources. For each technology, eHealth offers the following types of Live Health profiles:

Profile Name Description of Purpose

Failure Identifies problems with availability, errors, or other device

failures.

Delay Warns of overutilization or congestion problems which could

cause network delay.

Unusual workload Indicates when an element’s capacity or volume is outside its

typical performance for the baseline.

Latency Identifies when the network latency is slowing down. Latency

is usually measured between the eHealth system and the

device itself.




Profile Name Description of Purpose

Configuration

change

Detects when a device’s configuration has changed, such as

module/card insertions to a switch.

Security Warns of problems such as a firewall detecting a “ping of

death” attack, login failures, or unauthorized accesses.

Once you assign a profile to a group or group list of elements, Live Exceptions monitors the

group or group list to look for any activity that violates the specified rules, and produces

alarms when activity triggers any of the rules in the profile. With this integrated solution,

you can configure Live Health to send alerts to the SPECTRUM interface when problems

occur. Carefully review the Live Exceptions web help available with the product to ensure

that you understand performance and how the rules identify performance problems.

FIND L I VE EXCEPTIONS PROFILES

The Live Exceptions feature has hundreds of default profiles that you can use to monitoryour resources. To search and review the profiles that apply to your types of resources, use

the Live Health Profiles tool on the eHealth Certification support site.

To review available Live Exceptions profiles

1. Using a web browser, log on to http://support.concord.com.

2. Click Certification.

3. On the Certification page, click Live Health Profile Descriptions under Certification

Information.

4. Click Element Types to display the various types of resources that eHealth can monitor.

5. Scroll through the list of elements to locate the types that you are currently monitoring

For example, if you are monitoring CPUs, click CPU, Router/Switch CPU(1), Generic

Router/Switch CPU (2).

6. Click a profile name, and review the profile description to determine the types of

problems for which it will raise alarms.

You can also create custom profiles and rules. For a description of how to create rules and

profiles, see the Live Exceptions web help.

START L IVE EXCEPTIONS AND ASSOCIATE PROFILES

To use Live Exceptions, you must log in to the eHealth Web interface and download the LiveHealth client application to your local PC or workstation. Install the Live Exceptions client

following the instructions provided on the download page.

To access the eHealth web interface, use a web browser to navigate to

http://hostname:port, where hostname is the name or IP address of your eHealth system,

and port is the HTTP port used by the web server. If your Web server uses the default port

80, you can omit the port number. You must have an eHealth Web user account to log on

to the eHealth web interface.




4. Specify the following information for the SpectroSERVER under Edit Trap Destination:

› Hostname

› IP address

› Port number

5. Click Add.

6. Confirm that the name of the SpectroSERVER appears in the Existing Trap Destinations

list; then click OK.

7. Select Setup, Notifier Rules.

8. In the Notifier Manager dialog, click New.

9. In the Notifier Rule Editor dialog, do the following:

a. In the Name field, enter SPECTRUM.

b. In the Action list, select Send Trap.

c. In the To NMS list, select the SpectroSERVER that you specified in Step 4.

d. Under When an alarm is, select both Raised and Cleared.

e. Under Elements within, specify either a specific technology type or All

Tech/Subjects.

f. Click OK to save your Notifier rule.

10. Confirm that the Notifier rule appears; then, close the window.

Customize and Schedule Health Reports to Forward Traps

A Health report evaluates the health of a group of elements by comparing current

performance to historical performance over the course of a day, week, or month. The repor

identifies errors, unusual utilization rates, or shifts in volume that warrant investigation.

This report helps you evaluate the health of your resources by monitoring how efficiently

those resources are running, checking for availability of critical resources, and detecting

whether they are beginning to experience problems. The report analyzes trends based on

historical data and calculates averages using a service profile.

You can configure individual Health reports to forward traps for Health exceptions to the

SpectroSERVER. When a scheduled Health report runs, eHealth sends an SNMP trap to the

SpectroSERVER for the top problem of each element in the Exceptions section of the Health

report.

Note: Only scheduled Health reports forward exceptions. If you manually run a Health

report, it will not forward exceptions.




To forward exceptions from Health reports

1. Log in to the eHealth Console.

2. Select Reports, Customize, Health Reports.

3. Select the report from which you want to forward Health exceptions.

4. In the Presentation Attributes drop-down list, select General.

5. Select NMS IP and Port Trap Address in the Attribute table.

6. In the Value field, specify the SpectroSERVER IP address and SNMP port number,

separated by a colon. For example:

001.02.03.004:162

7. Click Apply to save.

8. In the Presentation Attributes drop-down list, select Exceptions.

9. Select Send Exceptions SNMP Trap in the Attribute table.

10. Select Yes in the value field.

11. Click OK.

12. Click Save to save the custom report.

13. Select Setup, Schedule Jobs.

14. Select Add Health Report from the list.

15. In the Add Scheduled Report dialog, do the following:

a. Select the report.

b. For the subject, select the technology type and group for the report.

c. Specify a time range for the report, and optionally, a time zone.

d. Select the format in which you would like to output the report.

e. Set the schedule for the job.

f. Click OK.

16. Click OK.

Configure e Health for Voice to Send Alerts to SPECTRUM

To allow SPECTRUM OneClick to show the voice-specific problems in PBXs, messaging

systems, and other voice infrastructure monitored by eHealth for Voice, configure the

eHealth for Voice Policy Manager to send alerts (SNMP traps) to SPECTRUM when a

particular condition occurs. To configure the Policy Manager, you create a policy based on a

defined action plan (the responses assigned to policies) and conditions.




To configure e Health for Voice to send alerts to SPECTRUM

1. On the system on which eHealth for Voice is installed, select Start, Programs, eHealth

for Voice, eHealth for Voice. The eHealth for Voice Program Console appears.

2. Select Tools, Service Setup to configure and start the Policy Manager service.

3. Click Configuration, Servers to configure the Email, SNMP, Web, and SPECTRUM

servers.

4. Define the actions to include in the action plan:

a. Click Templates in the Policy Manager group of the console tree.

b. Click Actions.

c. Right-click in the right pane and click New.

d. Complete the details for the action type. Specify information under the Properties

and the Configure tabs.

e. Click Save.

f. Click Cancel to close the window.

5. Create an action plan template to define the responses that you want to assign to the

policy:

a. Click Templates in the Policy Manager group of the console tree.

b. Click Action Plans.

c. Right-click in the right pane and select New.

d. Specify a name, description, time zone, and actions.

e. Click Save.

6. Create a policy based on that action plan:

a. Click Policies in the Policy Manager group of the console tree.

b. Click Global to create a policy based on eHealth for Voice global data.

c. Right-click in the blank area of the right pane, and select New from the menu.

d. Select Blank Policy, and click Next.

e. Specify a name and description.

7. Click Add.




8. Define the condition:

a. Specify the name, platform for the element, and data table.

b. Specify the build criteria.

c. Click Apply to save the condition.

9. Define the policy:

a. Select the time zone, operating interval, and timeframe.

b. Specify the number of times the condition should match the policy before triggering

the action plan.

c. Specify the severity level.

d. Select an action plan.

e. Click Save.

Configure SPECTRUM to Recognize the e Health Server

After completing the eHealth setup, you must also configure SPECTRUM to recognize the

eHealth server. This allows you to drill down to eHealth reports, as well as to clear alarms

from the OneClick console.

To enable SPECTRUM to recognize the e Health server

1. Log on to the SPECTRUM OneClick homepage using your SPECTRUM credentials and

click Administration at the top of the page.

2. From the Administration menu, select eHealth Configuration.

3. In the eHealth Configuration window, enter the following information:

› Hostname or IP address of the eHealth server.

› Port number on which eHealth listens for web requests.

› eHealth web administrator user name

› eHealth web administrator password

4. Select Started in the Alarm Notifier Status section to enable SPECTRUM to clear Live

Health alarms.

Note: If you configure eHealth to forward alarms to SPECTRUM, and configure

SPECTRUM to view eHealth alarms, the alarm notifier enables you to clear those alarms

directly from the OneClick console.

5. Click Save.




Configure SPECTRUM to View e Health Alarms

If you configured eHealth to forward Live Health alarms or Health exceptions to a

SpectroSERVER, you must also configure SPECTRUM to receive the alarms.

To enable SPECTRUM to view e Health alarms

1. Log in as a SPECTRUM administrator.

2. Select Start Console at the top of the OneClick page to launch the OneClick Console.

3. In the Explorer tab of the OneClick Navigation panel, select your SpectroSERVER, and

then select Universe.

Note: If you are monitoring multiple SpectroSERVERs, select Universe under the landscape

for the Trap Director SpectroSERVER.

4. In the Contents panel, select the Topology tab.

5. In the Topology tab toolbar area, click the Create a new model by type icon. The Select

Model Type dialog appears.

6. Select the All Model Types tab.

7. Select EventAdmin, and then click OK. The Create Model of Type dialog appears.

8. Specify the name and IP address of the eHealth server, and click OK. The eHealth

server appears in the topology as an EventAdmin model.

Note: For more information on creating a model in OneClick, see the Modeling Your IT

Infrastructure Administrator Guide.

9. Select the EventAdmin model in the OneClick Topology.

10. Right-click the EventAdmin model; then select Utilities, Attribute Editor. The Attribute

Editor dialog appears.

11. In the Attributes tree, select User Defined and click add. The Attribute Selector dialog

appears.

12. In the Select Model Type window, select Other, EventAdmin.

13. In the Attributes for EventAdmin window, select

map_traps_to_this_model_using_IP_header, and click OK. The attribute appears in the

User Defined list in the Attribute Editor.

14. Click the arrow that points to the right. The attribute moves to the right window.

15. In the right window, select map_traps_to_this_model_using_IP_header, and select Yes

16. Click Apply. SPECTRUM applies the attributes to the model, and the Attribute Edit

Results dialog appears.

17. Confirm your changes in the Attribute Edit Results window, and click Close.

18. Click OK in the Attribute Editor.




Chapter 6: Gathering System

Information from AgentsSystems are important components of the network. They typically contain your critical

business applications, such as web servers, database applications, e-mail applications, and

other company-critical applications. When their performance degrades, users are unable to

run their applications and perform tasks that utilize those servers. This chapter discusses

how to gather system monitoring information, and outlines best practices for configuring

SPECTRUM and eHealth to manage Unicenter NSM, SystemEDGE, and third-party system

agents.

Deployment and Administration of System Agents

SPECTRUM and eHealth leverage installed system monitoring agents for fault and

performance information. This chapter describes three types of system agents:

CA Unicenter NSM agents

CA SystemEDGE agents

Third-party agents

Important: The installation procedures for the Unicenter NSM and SystemEDGE agents are

described in detail in product-specific installation guides. You must use those guides to

correctly install the agents. After you complete the software installation, review this

chapter to obtain the best practices for configuring SPECTRUM and eHealth to leverage

these agents.

Best Practices

To facilitate the monitoring and management of system agents, follow these best practices:

Ensure that the system agent has been successfully installed.

Configure an SNMP read-only or read-write community string on the system.

Configure the system agent to send traps to the SpectroSERVER.

Confirm that you have specified the correct IP address and community string to discover

the agents.

Confirm that your systems have only one management agent enabled and running on

them. Systems can sometimes have multiple SNMP agents. For example, they could have

the Microsoft SNMP agent and a CA SystemEDGE agent. If multiple agents are running

and responding to SNMP queries, SPECTRUM and eHealth could model both agents for the

one system. For more information, see the Unicenter NSM Agents section later in this

chapter.




Supported Agents

The following table highlights the Unicenter NSM agents supported by eHealth and

SPECTRUM based on release. The Active Directory and Performance agents are supported

only for eHealth reporting.

Unicenter NSM r11 Systems Agents Unicenter NSM 3.1 Systems Agents

UNIX System Agent (caiUxsA2) UNIX System Agent (caiUxOs)

Windows System Agent (caiWinA3) Windows System Agent (caiW2kOs)

Active Directory Services Agent

(caiAdsA2)

Active Directory Services Agent

(caiAdsA2)

Log Agent (caiLogA2) Log Agent (caiLogA2)

Performance Agent (hpxAgent) Performance Agent (hpxAgent)

SPECTRUM and eHealth also support all SystemEDGE agents as well as a variety of third-

party agents provided by vendors such as Microsoft, Dell, Sun, HP, and IBM. In addition,

these applications also support any MIB-II or RFC 2790-compliant agents. These

applications provide out-of-box automated fault management, trap support, and

performance reporting and trending.

Note: For agents that support and use the RFC 2790 extensions of MIB-II, SPECTRUM can

perform process, file system, and log file monitoring in addition to basic host systemsperformance monitoring. If you discover agents that do not have the RFC 2790 extensions,

only basic host systems performance and log file monitoring may be possible.

Prerequisites

Before you begin, do the following:

Confirm that you have administrator account access to both the SPECTRUM OneClick

console and the eHealth console.

If you are not familiar with the SPECTRUM OneClick console, see the OneClick

Administration Guide for more information.

If you are not familiar with the eHealth interfaces, review the descriptions of the eHealth

console and the OneClick for eHealth (OneClickEH) console provided in the eHealth

Administration Overview Guide.

How You Add System Agents in SPECTRUM

You can use either of these methods to add the system agents to SPECTRUM:

Automatically discover the system agents using SPECTRUM’s AutoDiscovery application.

Manually add the system agents to SPECTRUM.




AUTOMATICALLY DISCOVER SYSTEM AGENTS

SPECTRUM can automatically discover and model your system resources using auto-

discovery capabilities.

To automatically discover your systems

1.

In the SPECTRUM OneClick console, select Tools, Utilities, Discovery, New Discovery.

The Discover dialog appears.




2. In the Discovery window, do the following:

› Specify a configuration name.

› Specify an IP range or list, or select Import to import an IP list file.

› Specify a valid community string. If you specify more than one, OneClick uses

the entry at the top first.

› Select Discover Only in Modeling Options.

› Click Advanced Options and specify port 6665 to discover Unicenter NSM agents.

3. Click Discover. The Discovery dialog appears.




4. (Optional) After the results set appears, exclude entries by right-clicking them and

selecting Exclude. This prevents those devices from being “modeled” in the SPECTRUM

database.

5. Click Model to add the systems to SPECTRUM.




6. Under Model Options, deselect Create Wide Area Link Models and Create LANs. By

deselecting these options, SPECTRUM does not automatically create “subnet”

containers. For more information about discovery and modeling options, see the

Modeling Your IT Infrastructure Administrator Guide.

7. Click OK.

8. After the systems are modeled, click Close in the Discovery dialog.




9. (Optional) Click the paper and pencil icon in the left corner of the Tools menu.

10. Edit the topology by moving icons and add background images, as desired.

MANUALLY ADD A DEVICE USING CREATE MODEL BY IP ADDRESS

As an alternative to automatically discovering your systems, you can manually model them.

This procedure works for systems that do not respond to discovery.

To model your system resources manually

1. Using the Explorer tab of the OneClick navigational panel, navigate to the Universe

topology view in which you want the new device to appear. The selected Universe

topology view appears under the Topology tab of the Contents panel. Tip: If you want

to place the new device inside a network group container, double-click the container

icon to display the topology view for that container.

2. In the Topology tab toolbar area, click the Create model by IP address button. The

Create Model by IP Address dialog appears.

Note: To remove a modeled element from a view, select the element and click

Delete (X).




Tips: To move or enhance the appearance of the recently modeled device icon, click the

Edit mode button in the Topology tab toolbar. You can edit and arrange the model devicesusing the following techniques:

To copy or paste the modeled device icon to another topology view other than the

Universe topology, use the copy and paste functions in the Topology tab toolbar area.

To change configuration parameters of a modeled device (for example, community name,

polling interval, logging interval, security string, and so on), select the modeled device

and change the appropriate settings in the Component Detail panel.

Unicenter NSM Agents

You can discover and model Unicenter NSM agents automatically using SPECTRUM

discovery, or you can manually model them. Because Unicenter NSM agents use UDP port

6665 for SNMP communications, by default, rather than the standard SNMP port 161,SPECTRUM can discover and model other agents running on the host device. For example, if

a Windows workstation is running a Unicenter NSM agent bound to port 6665, as well as

the Microsoft SNMP agent bound to port 161, SPECTRUM will create two models for the

device; a Unicenter NSM System Host device model and a Windows Host device model, as

shown in the following figure.




This scenario can cause poor performance for the following reasons:

It creates unnecessary duplicate models in SPECTRUM.

It causes redundant SNMP traffic and polling which can reduce network and SPECTRUM

performance.

It reduces performance of the agent host machine because multiple management agents

are providing performance data.

To avoid this scenario, do the following:

1. Before discovering and modeling, stop and/or remove all management agents except

the one that you want to use to manage the system. By doing this, you can avoid

creating and managing multiple models in SPECTRUM for the same host. Remember to

use the correct SNMP port for the discovery.

2. If you must run more than one agent on a given host system, consider manually

modeling only the agent that you want to manage with SPECTRUM.

How You Add System Agents in e Health

After you discover and model systems using SPECTRUM, import a SPECTRUM Global

Collection to add those system resources to eHealth for reporting and Live Health

monitoring. You could add the systems using eHealth discovery as well, but as a best

practice for the integrated solution, import the systems from SPECTRUM as described in

Chapter 5. After a few eHealth poll cycles, you can run At-a-Glance reports and Trend

reports from the OneClick interface.

Performance Reporting On System Agents

eHealth normalizes common performance data across all managed system agents

(Unicenter NSM, SystemEDGE, and third-party). By presenting all performance data in a

common and understandable format, this minimizes the learning curve for all users who

access real-time and historical trending reports.

At-a-Glance Reports

An eHealth At-a-Glance report for system elements provides summary capacity statistics fo

the specified system including CPU, interface, and partition utilization; disk faults and I/O;

and system availability. With these reports, you can quickly isolate busy CPUs or full disks

and compare groups of systems. A sample At-a-Glance report for a system element follows.




HOW Y OU RUN AT-A-GLANCE REPORTS

You can run At-a-Glance reports using one of these methods:

In SPECTRUM OneClick, right-click a device and click At-a-Glance Reports. The integrated

At-a-Glance report runs in the background and appears automatically in a web browser

on your system.

With a web browser, log in to the eHealth Web interface at the URL

http://hostname:port, where hostname is the name or IP address of your eHealth

system, and port is the HTTP port used by the web server. If your Web server uses the

default port 80, you can omit the port number. You must have an eHealth Web user

account to log in to the eHealth web interface. Navigate to the Run Reports tab and run

an At-a-Glance report on demand.

MyHealth Reports for Systems

The MyHealth report page on the eHealth Web interface contains a series of charts that are

tailored to your particular interest. MyHealth provides eHealth web users with one or more

customized reports on the elements and groups that they consider critical. A MyHealth

report page contains one or more panels, and each panel contains a separate chart.

Health Reports for Systems

A Health report contains information about the performance of a group of elements for a

report period and alerts you to situations that require your attention. The report also

identifies situations to investigate because of errors, unusual utilization rates, or excessive

volume.

You can use a Health report to do the following:

Identify normal and exceptional system behavior.

Compare the performance of a group of elements during a report period to their

performance over a baseline period.

Detect changes in behavior that indicate imminent or existing problems.

Identify trends in volume.

Identify systems that require further investigation.

Using Live Trend

You can use Live Trend to create charts that monitor statistics elements that you are polling

using eHealth. You can create a single chart or multiple charts in various styles to represent

element trends (a single element with multiple variables) or variable trends (a single

variable for multiple elements). The following chart shows a Live Trend chart for four

variables on a system called atlanta.




START L IVE TREND

To use Live Trend, you must log in to the eHealth Web interface and download the Live

Health client application to your local PC or workstation. Install the Live Health client

following the instructions provided on the download page. You can then start the Live Trend

application to run real-time performance charts for your systems and resources.

To start the Live Trend application

1. Make sure that you have downloaded and installed the Live Health client software from

the eHealth Web interface.

2. Do one of the following to open the Live Trend application:

› If your system is a Windows system, select Start, Programs, eHealth, Live Trend.

Your program group name will vary, depending on the name that you used when

you installed the Live Health client.

›

On a UNIX system, change to the Live Health client installation directory and runthe command nhLiveTrend.

3. In the eHealth System field in the Live Trend application window, specify the name of

the system to which you want to connect, and then specify your user name and

password. The Live Trend Chart Definition Manager appears.

You can create your own charts through the Live Trend Chart Definition Editor to specify the

elements and variables for which you want to view data. For more information, see the Live

Trend web help that is accessible from the eHealth Web interface.




How You Run Trend Reports for Systems

You can use Trend reports to determine the value of one or more variables for your

systems over a specified report period. This can help you to track the values of the

variables to determine when values might have changed radically or when a particular

event, such as a reboot or missed poll, occurred.

The Trend variables differ for each element type. You can run reports for the following

types of systems and system components:

CPU

Disk

LAN

Process and process set

User or system partition

WAN

Each of these types includes specific variables on which you can run reports. For example,

server disk elements have variables for disk reads and writes, storage capacity, and storage

utilization. You can select up to ten variables at a time on which to run a Trend report. For

a complete list of system Trend variables, see the eHealth web help.

The following sample Trend report shows several common system variables:

Total Bytes

Total Incoming Bytes

Total Outgoing Bytes

System Calls




RUN A TREND REPORT

You can run Trend reports from the eHealth web interface.

To create a Trend report similar to the example above

1. Use a web browser to log on to the eHealth web interface.

2. Click the Run Reports tab.

3. Scroll the Available Reports frame to the Trend reports section.

4. Click Standard.

5. Select the System Element Type.

6. Scroll the Elements list and select the target system element.

7. Scroll the Variables list and select variables; the sample report shows the four variables

Total Bytes, Total Incoming Bytes, Total Outgoing Bytes, and System Calls.




8. Select the chart type such as a stacked line chart.

9. Scroll the right frame and click More Options.

10. Select Show Summary Statistics in the General tab to show the tabular data below the

chart.

11. Click Generate Report. eHealth processes the report data and displays the Trend report.

Top N Reports

A Top N report lists all of the elements in a group that exceed or fall below the report

criteria goals that you specify. You can also specify the goal for each variable. eHealth

calculates the difference between the actual value for that variable and the goal that you

have set.

What-If Capacity Trend Reports for Systems

The eHealth What-If Capacity Trend report enables you to perform capacity planning by

adjusting factors for capacity and demand until you have devised an appropriate what-if

solution. By giving you the capability to illustrate possible future scenarios, this report helpsyou prepare for problems before they occur.




Chapter 7: Service Level

ManagementAt the core of CA’s solution is a concept called Business Service Intelligence (BSI) — a

methodology for understanding the relationships and impact of IT infrastructure on

business services. BSI delivers Technology Relationship Mapping, impact analysis, and RCA

that enables our customers to evolve their IT organizations from being tactically reactive to

strategically proactive, while improving IT service quality from a customer and business

perspective.

BSI provides adaptive analytics that communicate bi-directionally with thousands of multi-

vendor, multi-technology devices to identify, verify, and solve complex problems using

model-based, rule-based, and policy-based correlation engines. Business service definition

and on-going maintenance issues are eased through automation, while asset, availability,

capacity planning, change management, performance, and trend analysis validate SLA

compliance. BSI provides a bottom-up approach to Business Service Management (BSM)

that is practical, achievable, and delivers rapid time-to-value.

BSM provides the most obvious value when the basic fault management data is insufficient

and it requires additional correlation to determine the impact that may have occurred as

the result of a fault and to identify the business services that may be impacted. The

SPECTRUM Service Management module features the ability to organize, analyze, and

control all aspects of this area. It also provides a dashboard view as an extension of

OneClick that focuses directly on service health and hides the complexity of topology that is

normally seen using OneClick.

In general, the approach to service management can be described as “top-down” to identify

the relationships and dependencies of devices, systems, applications, or performance

measurements. Within SPECTRUM, they are referred to as resources (models or data) and

relationships. These resources and relationships are organized into service or subservice

models. You should define a service from the bottom up to permit the future reuse of

common services or subservices. You can configure SLAs to dynamically measure violations

and send alerts.

This chapter describes an approach to designing and implementing a service management

system within the SPECTRUM application. Unlike most of the functions within SPECTRUM,

preparing for service management may involve considerable planning to determine all of

the required information and implications.

The SPECTRUM methodology is designed to evolve over time. As more information becomes

available as the implementation proceeds, a more granular representation and

measurement of service modeling and management typically emerges.

Additional References:

Service Manager User Guide

Report Manager User Guide

SERVICE Performance Manager User Guide




Interview Procedures

This section describes an approach to designing and implementing the interview process for

service management.

Interview Questions

To organize and implement service management and service level management, documentand collect responses to the following interview questions:

Which business services do you want to monitor?

Which particular resources support those services?

› Processes?

› Software applications?

› IT devices?

How can conditions and faults that affect services be detected?

Which resource attributes should be monitored to determine the health of a service?

Who should be notified if a given service fails?

What are the SLAs, and how should they be quantified (metrics)?

What is the criticality of a given service relative to other services or subservices?

General Questions

To organize and implement service management and service level management, document

and collect responses to the following general questions:

Which WAN and LAN technologies support the service?

Are QoS CoSs currently set up?

Are the MPLS-based VPNs that are currently in use being monitored?

Are all of the critical network devices and servers manageable and being managed?

Are all elements being properly discovered and mapped down to layer 2 or layer 3 as

appropriate?

Are thresholds configured on your critical interfaces?

› Error rate?

› Discard rate?

› Load, etc?

Do any environmental monitors need to be monitored (temperature or humidity)?

Do any power systems or battery backup systems need to be monitored? Are the critical log files or windows event logs being monitored?

Are the critical processes or windows services being monitored?




Are application ports being tested?

› File Transfer Protocol (FTP)?

› Hypertext Transfer Protocol (HTTP)?

› Domain Name System (DNS)?

For more advanced needs, are custom thresholds/alarms configured?

Can any model attributes be used to determine the health of a resource?

Have unique alarms been created using event or condition correlation?

Are any existing or custom integrations enabled with alarm data that can be used?

Of the IT resources listed, who is responsible for the proper operation of each?

Does each individual have access to the correct tools, and do they have their contact

information (email, phone, etc) available for distribution?

Do each of the resources have a corresponding troubleshooter, or are troubleshooters

added for proper notification?

Which users benefit from the IT resources listed?

Rate each user relative to each other:

› Low

› Medium-low

› Medium

› Medium-high

› High

Do logical groups of users exist for ordering purposes?

› Department

› Function

› Role, etc

Is the device criticality defined and/or measured for all of the network devices and

servers?

Analysis and Mapping Procedures

This section describes an approach to analyzing and mapping the results of the interview

process for service management.

How You Organize the Resource Information

Follow these best practices to organize the information that you collected for your services.

Sort the information by common information types such as the following:

› Application names

› Server names

› Device names and or types

› Metrics measurement and sources of that data

Identify logical groupings of these common resources to avoid duplication.




How You I llustrate the Relationships of Resources to Each Other

Create a diagram that shows how the resources relate to each other. The diagram can help

you to map the way in which resources depend on or impact each other.

How You Decompose the Information and Mapping to Service

Models

The information that you gather and prepare can help you to build the service models that

you need to monitor with SPECTRUM.

Take a bottom-up approach to create the most common resource models first.

Create a service model by creating a relationship to the proper subservice models.

Add service-specific resources not available via subservices.

Example of a Business Service Map to Service Models

Customer ABC has identified a critical business process. When their clients place phone

orders, operators enter these orders into a web-based order processing system. These

orders are stored and processed from an Oracle database. Because many problems can

occur throughout this process, customer ABC wants to build a service that will indicate

when the order processing is adversely impacted. In the interview process, some of the

critical items were identified and then grouped in the following hierarchy:

Web server (WEBORDER1)

› Dell hardware, running SNMP agent (RFC 2790 or equivalent)

› Microsoft Internet Information Services (IIS) web server

› Log file with critical data flow entries

› CPU and memory need to be monitored

› APC uninterruptible power supply (UPS) battery backup

› Proper response from web server required

Oracle Database Server (WEBDB1)

› Dell hardware, running SNMP agent (RFC 2790 or equivalent)

› Oracle Database with Oracle Intelligent Agent

› CPU and memory need to be monitored

› APC UPS battery backup

Cisco 6509 Catalyst switches

› DATASW1 responsible for Server connections

› DATASW2 responsible for Operator Workstation connections

25 Operator Workstations

› DNS service monitoring is required

› Dynamic Host Configuration Protocol (DHCP) service must function




By posing more questions such as the following, you can discover the criticality of items by

possible faults:

What are the most catastrophic failures that could occur?

Is it possible to measure all of the items chosen?

What would be the criticality of each item relative to every other item?

Can any items within the list be reused, or are any necessary as a generic service to

other IT business processes?

What are the processes by which you want to manage problems?

What would be more critical, losing 25% of the workstations, or losing the switch used to

connect the servers?

Start by grouping the most critical assets and the most critical outages. Most certainly, the

loss of the servers or switches would be the most catastrophic failure to occur, so begin by

grouping items as follows:

SERVICE: Web Order Processing

› Components – WEBORDER1, WEBDB1, DATASW1, DATASW2

If any component is down, service is down.

Ports on switches with server connections

If either port is down, service is down.

› Components – operator workstations

If 75% of the workstations are down, service is down.

If 50% of the workstations are down, service is degraded.

If 25% of the workstations are down, service is slightly degraded.

› Performance – web response time, TCP port for Oracle

If both are critical, service is down.

If one or the other is critical, service is degraded.

If one or the other is violated, service is slightly degraded.

› Alarm condition of all four resources

General criticality for alarm conditions (minor, major, or critical)

It is also necessary to determine how users are affected when these business services are

impacted. To put it simply: who is affected when your business service is impacted, and

how critical is that person? A customer who cannot access your sales website will be very

inconvenienced; therefore, that customer is very likely a critical (very important) user.

If your internal users cannot access an internal web server that is not very important fortheir day-to-day tasks, assign a much lower criticality to that problem. Answer the following

questions to help ascertain the impact of our business process:

Of the server users listed, can you sort the list of users by relative importance?

Once listed, can you organize these users or customers by company, organization,

department, or role?




You also need to consider the network services. Although more general, network services

such as the following do affect the business service: DNS, DHCP, and e-mail. You can set

up response time tests and use a service to monitor the servers providing the service;

however, you should treat them slightly differently. Since these services may be a common

dependency for other services, you should create them with reuse and modularity in mind.

An example of a DNS and DHCP service might look like this:

SUBSERVICE: DNS

› Components – DNS servers SERVER-DNS1 and SERVER-DNS2

If both servers are down, the service is down.

If one server is down, service is slightly degraded.

› Response time tests – test DNS response time

If response time is violated, service is slightly degraded.

› Alarm condition of both resources

General criticality for alarm condition (minor, major, or critical)

Creating Service Models and Relationships

This section introduces service modeling concepts and techniques. Before creating service

models, you should gain an understanding of a few key concepts.

Key Concepts

Resource Monitoring Every service model is a resource monitor that actively monitors

its resources to determine its own service health. Service resources are SPECTRUM

models, and virtually any model could be a service resource. Service resources might

consist of device models, interface models, SPM tests, process models, and even other

service models. To monitor a resource, the service watches specific attributes of the

resource model. A service model can monitor any attribute whose values are whole

numbers. This behavior of a service watching the attribute values of its resources is called

resource monitoring.

Service Health Service health is represented by a small set of values: up, down,

degraded, and slightly degraded. Each resource monitor determines its own service

health based on attribute values from its resources. Specifically, a service health policy is

applied to the collective attribute values from all resources. A policy is essentially a

formula which calculates a service health value based on one or more resource attribute

values. The logic applied by the policy is encapsulated into a set of policy rules. Each rule

is a statement which, when evaluated, will be labeled as true or false. When a policy is

evaluated, the first rule that is found to be true, or the first rule satisfied, determines the

service health taken on by the service or resource monitor.

Root Cause and Service Impact Considering that a service determines its own healthby monitoring its resources, a logical relationship exists between resource outages and

service health. This relationship is expressed in terms of root cause and service impact.

When a resource outage results in a change in the health of a service, that outage is the

root cause of the service health change. Likewise, when a resource outage affects service

health, the outage has a service impact. These concepts become very important for users

who must address service outages.




Hierarchical Service Modeling As mentioned above, each service is a resource

monitor that determines its own health by applying a policy to a set of attribute values

from its resources. It is important to note that a service can monitor resources that are

actually other services. This allows for the creation of service hierarchies; thus, a user

can build services with components of other services. This allows for service modeling to

extend from very low-level fundamental services to high-level conceptual service models.

How You Create Service Models

The process of creating a service model is composed of two primary steps:

1. Select resources.

2. Select the policy that monitors the resources.

The following examples show how a user might create service models representing a web-

based service. For more information about creating service models, see the Service

Manager User Guide.

Example 1: A Customer Account Access Service

Determining the resources of a particular service can seem like a daunting task. In many

cases, it is not possible for you to consider all possible components of a service, and then

map how each component might impact a given service. One distinct advantage of

SPECTRUM Service Manager is that you can start with small, simple models, and continually

refine their service modeling as you gain a better understanding of the service components

and how each one impacts the overall service.

Although understanding all components of a service is difficult, it is usually easy to identify

some of the most critical components, which provides an effective starting point. For

example, consider a simple web service used by a phone support organization to access

customer account data. This service will be referred to as the Customer Account Access

Service.

With just this basic, general description, you can begin to identify some of the service

components. As this is a web-based service, it must be supported by one or more web

servers. In addition, the service is providing access to information from a database of

customer accounts. This implies that the database is likely hosted on one or more systems.

For this example, consider an environment with two web servers, and two database

servers. This provides a starting point for modeling the service. If both web servers, or both

database servers, are down, the entire service will not work; as long as one web server and

one database service is up, the service will run, even though it will likely experience some

degradation. This very simple description provides the basis for creating the Customer

Account Access Service.

To begin device modeling in SPECTRUM, consider each web server and database host to

be resource models. Monitoring the contact status of these device models will determine

if the systems are up. As mentioned in the Key Concepts section, each service is a

resource monitor.

SPECTRUM offers a basic formula for service health which provides a general

understanding of the availability of these four service resources. The following table

presents a matrix containing each component and how its status (up/down) would affect

the service relative to the status of the other resources.




Service Health Matrix Table

Web Server 1 Web Server 2 DB Server 1 DB Server 2 Service

ESTABLISHED ESTABLISHED ESTABLISHED ESTABLISHED UP

LOST ESTABLISHED ESTABLISHED ESTABLISHED SLIGHTLY

DEGRADED

ESTABLISHED LOST ESTABLISHED ESTABLISHED SLIGHTLY

DEGRADED

ESTABLISHED ESTABLISHED LOST ESTABLISHED SLIGHTLY

DEGRADED

ESTABLISHED ESTABLISHED ESTABLISHED LOST SLIGHTLY

DEGRADED

LOST ESTABLISHED LOST ESTABLISHED DEGRADED

LOST ESTABLISHED ESTABLISHED LOST DEGRADED

ESTABLISHED LOST LOST ESTABLISHED DEGRADED

ESTABLISHED LOST ESTABLISHED LOST DEGRADED

LOST LOST ESTABLISHED ESTABLISHED DOWN

LOST ESTABLISHED LOST LOST DOWN

ESTABLISHED LOST LOST LOST DOWN

ESTABLISHED ESTABLISHED LOST LOST DOWN

LOST LOST LOST LOST DOWN

This table indicates that if both web servers or both database servers are down, the service

is down. If one web server is down and one database server is down, the service is

degraded. If any one server is down, the service is slightly degraded. This is a very

simplified approach, but it demonstrates a good starting point.

From this, you can consider how to monitor each component. The table shows thatparticular combinations of status values result in specific levels of service degradation.

Essentially, you can classify the resources into web server components and database

components, and think of the grouped resources as services within a service. To enable the

Customer Account Access Service to function, the web server components and database

server components must be functioning. Within the Customer Account Access Service,

small, more discrete, subservices may exist.




Considering that each service is also a resource monitor, this example is a good case for

creating resource monitors within the Customer Account Access Service, as shown below.

Resource monitors allow you to organize resources, and monitor them based on specific

criteria with knowledge of how it will impact the service. The resource monitor becomes an

abstraction of multiple resources, and reports a health value based on the collective status

of the resources it monitors. That knowledge is the basis for a service resource monitoring

policy.

In this example, it was established that the contact status of each device model can be

monitored to determine its availability. In addition, the table indicated how different

combinations of contact status values impact the service. Looking first at the web servers, a

policy can be produced which will adequately report the status of the web server

components as a whole. These statements can be called the web servers redundancy

policy.

WEB SERVERS REDUNDANCY P OLICY

When the contact status of all web servers is lost, the web server component of the

service is down.

When the contact status of any one web server is lost, the web server component of the

service is degraded.

The web components and database components are described as services within a service.

That concept is important when dealing with groups of resources that support a specific

aspect of a service. In all cases, if both web server machines are down, the Customer

Account Access Service is down. However, just one web server down does not necessarily

indicate that the Customer Account Access Service is down, or even degraded. You can

think of the web servers, collectively, as a component of the service in that when one of

those servers is down, that component of the service is degraded. This might not be

completely clear yet, but as the service model evolves, it will become apparent why this

approach should be taken.

The impact of loss of contact with the database servers mirrors that of the web servers.

These statements can be referred to as the database servers redundancy policy.

DATABASE SERVERS REDUNDANCY P OLICY

When the contact status of all database servers is lost, the database component of theservice is down.

When the contact status of any one database server is lost, the database component of

the service is degraded.

The web servers and database servers have been described collectively as a web server

component and a database component. Consider how the web server and database server

components impact the Customer Account Access Service.




If both web servers — or both database servers — are down, the service is down. These

were organized into two groups: a web server component and a database component.

These can be labeled as the Web Servers resource monitor and the Database Servers

resource monitor. In review, each resource monitor determines its own health value, based

on the resources that it is monitoring. The Web Servers resource monitor determines its

health based on the contact status of both web server 1 and web server 2. The Database

Servers resource monitor determines its health based on the contact status of databaseserver 1 and database server 2.

Encapsulating the web server and database server systems into resource monitors

considers these statements, which will be called the Standard Account Access Policy.

STANDARD ACCOUNT ACCESS POLICY

When any resource monitor is down, the Customer Account Access Service is down.

When all resource monitors are degraded, the Customer Account Access Service is

degraded.

When any one resource monitor is degraded, the Customer Account Access Service is

slightly degraded.

Although redundancy exists within each resource monitor, if either resource monitor is

down, the overall service is down. Looking back at the table of Contact Status and Service

Health values, this design can be validated. You can use the following three scenarios to

test the design.

DESIGN TEST SCENARI OS

Web Server 1 is down. This will cause the web server resource monitor to become

degraded; the database servers are not affected, so the database servers resource

monitor is up. To apply the rules defined in the account access policy:

› The first rule is not satisfied because neither of the resource monitors is down.

› The second rule is not satisfied because the Database Server resource monitor is

up, and not degraded.

› The third rule; however, is satisfied, because the Web Server resource monitor is

degraded.

In this scenario, the Customer Account Access Service will report slightly degraded.

Looking back at the matrix, when web server 1 was down and all other devices were up,

the overall service health should be considered slightly degraded. The design works for

this scenario.

Web Server 1 is down and Database Server 1 is down. Based on the implementation

described above, this would result in both the web server’s resource monitor and the

database server’s resource monitor becoming degraded. By evaluating the account access

policy, the second rule is satisfied and the Customer Account Access Service will bedegraded. By reviewing the matrix, when web server 1 and database server 1 are both

down, the overall service health should be degraded, so, again, this design works

correctly.

Database Server 1 and Database Server 2 are down. If this was the case, the database

server’s resource monitor would be down. In review of the Account Access Policy, the first

rule is satisfied; thus, it produces a result of down. By reviewing the matrix, when both

database servers 1 and 2 are down, the overall health of the service should be down.




Although it is a very simple example, this process has identified the resources of a service

and how to monitor them. Despite its simplicity, this implementation provides the

knowledge to correctly report the health of the Customer Account Access Service for

thirteen different fault scenarios involving the systems which host the web servers and

database server applications. Obviously, this implementation is not yet very robust because

it is only monitoring the four systems as up or down. Before extending the Customer

Account Access Service, review the following steps to implement this design usingSPECTRUM Service Manager.

IMP LEMENT EXAMPLE 1 IN SPECTRUM

You create Service Models in SPECTRUM using the Service Editor, which you launch from

the Tools, Utilities menu of the OneClick Console.

To start building the service model

1. In the OneClick Console, select Tools, Utilities, Service Editor.

2. Click Create.

3. Specify the policy name Web Server Contact Monitor, and a description and securitystring.

4. Click the Locate resources and containers button (binoculars).

5. In the left pane of the Locate Resources dialog, click Devices, Devices, By Model Name

(or By IP Address). Locate the selected search (binoculars). Specify search criteria

(leading and trailing wildcards are implicit for model name) and click OK.

6. In the right pane, select all server models that you would like to associate with this

service model.

7. Click Add Selected to Monitored Resources.

8. Click Close.

9. Click Select to display the Select Policy dialog. The resource monitor will use the Web

Servers Redundancy Policy described previously in this chapter.

10. In the left pane, select Contact Status as the Value Map.

11. Click New in the Rule Set, and name the rule set Web Server Redundancy Rules.

12. Click Add to create the first rule: Rule Type All, When all are Down, the service is

Down.

13. Click OK.

14. Click Add to create the second rule: Rule Type Any, When any 1 are Down, the service

is Degraded.

15. Click OK.

16. Click Create in the Create Rule Set dialog.

17. Click OK.




18. Click Create in the Create Service dialog.

19. Repeat Steps 2 through 10 to start creating the Database Server Contact Monitor.

Note: This policy will be identical to the Web Server Redundancy Rules.

20. In the right pane, select Web Servers Redundancy Rule and click Copy.

21. Define the new rules name as Database Server Redundancy Rules.

22. Click Create.

23. Click OK to close the Select Policy dialog.


25. Click Create to start creating the top level service (Customer Account Access Service)

of hierarchal structure.

26. Specify the service name Customer Account Access Service, and, optionally, a

description and security string.

27. Click the binoculars, and then click Locater, Services, Services, All. Launch the selected

search (binoculars).

28. Select the Landscape, if it appears.

29. Select the Web Server Contact Monitor and Database Server Contact Monitor Services.

30. Click Add Selected to Monitored Resources.

31. Click Close.

32. Click Select to display the Select Policy dialog.

33. In the left pane, select Service Health as the Value Map.

34. Click New in the Rule Set, and name the rule set Standard Account Access Policy, which

is based on the following set of rules:

› Rule Type Any: When any 1 are Down, the service is Down.

› Rule Type All: When all are Degraded, the service Degraded.

› Rule Type Any: When any 1 are Degraded, the service is Slightly Degraded.

35. Click Create in the Rule Set dialog.

36. Click OK.


38. Close the window.




REVIEW EXAMPLE 1

The design for Example 1 includes one Service monitoring two Resource Monitors. This is a

two-tiered approach in which each Resource Monitor consolidates the status of its own

resources, and then reports the result as its service health. The Customer Account Access

Service then determines its own service health, based on the collective service health of the

two resource monitors. This pattern encompasses an important abstraction that is essential

to understanding service management.

Each service and resource monitor performs two tasks:

Monitors those “resources” to which they are related.

Determines its own service health by applying values from those resources to a policy.

Consider these questions regarding the implementation of Example 1:

Does the Customer Account Access Service have any knowledge of database server 2?

The three test scenarios do not mention the Customer Account Access Service monitoring

database server 2; however, in scenario 3, when both database servers 1 and 2 are down,

the Customer Account Access Service correctly determined that its service health should be

down.

How did it work?

The Database Servers Resource Monitor determined that its own health was down. The

Customer Account Access Service, which monitors the Web Servers Resource Monitor and

the Database Servers Resource Monitor, determined that it, too, should be down. When

evaluating its Account Access Policy, it found that one of the resource monitors was down

and, therefore, its own health should be down. Database servers 1 and 2 are “resources” of

the Database Servers Resource Monitor, and the Database Resource Monitor is a “resource”

of the Customer Account Access Service. Each component determines its own health based

on its resources.

Example 2: Extend the Service to Monitor Critical Processes

Example 1 describes how to design and implement a very simple service using two resource

monitors. Although this is a legitimate service, it is not a very complete one. In revisiting

the Customer Account Access Service, you could expand the monitoring of service

components in several ways. So far, only the Contact Status of those devices hosting the

web servers and database servers has been incorporated into the service. Device

availability alone does not ensure that you will be able to obtain customer account

information.

You need to also consider that a web server is an application that supports web

transactions. This application must be running in order for customer account access

requests to be processed. Considering the criticality of these web server systems, it is

logical that they will also host an agent supporting process monitoring, or host information

MIB such as defined by RFC 2790. This allows a user to actually monitor the web server

process itself.

You can use a process model to determine if a particular process is actually running on a

device. Considering that the web server system might be up, but the web server application

might not running, additional monitoring of the web server application processes is

important to correctly determine the overall health of the Customer Account Access Service




At first, it might appear simple enough to just add another resource monitor to watch the

Condition of the web server process model and treat the availability of each process

redundantly, in the same way as the device availability is monitored. Consider the following

table, which shows a breakdown of potential fault scenarios and how each combination

affects the availability of the web servers in terms of being able to process a request. This

table demonstrates what is often called a “high-sensitivity policy.”

Service Health Matrix – Servers and Processes

Web Server 1 Web Server 2 Process 1 Process 2 Web Service

Health

ESTABLISHED ESTABLISHED NORMAL NORMAL UP

ESTABLISHED ESTABLISHED CRITICAL NORMAL DEGRADED

ESTABLISHED ESTABLISHED NORMAL CRITICAL DEGRADED

ESTABLISHED ESTABLISHED CRITICAL CRITICAL DOWN

LOST ESTABLISHED CRITICAL NORMAL DEGRADED

LOST ESTABLISHED CRITICAL CRITICAL DOWN

LOST LOST CRITICAL CRITICAL DOWN

The following table replaces the individual devices and processes with the resource

monitors that could be used to monitor them.

Service Health Matrix – Devices and Processes

SERVER DEVICES SERVER PROCESSES WEB SERVICE

UP UP UP

UP DEGRADED DEGRADED

UP DOWN DOWN

DEGRADED DEGRADED DEGRADED

DEGRADED DOWN DOWN

DOWN DOWN DOWN

DEGRADED UP DEGRADED

DOWN UP DOWN

DOWN DEGRADED DOWN

Note: The three rows at the end of this table typically would not happen since a system

that is reported as down should not report that it has running processes. However, the

rules should handle these situations to avoid the possibility of getting into unknown states.




Looking at the devices and processes collectively, the table indicates that this is not a case

for a redundancy policy, which appeared to be the first choice when evaluating the

resources. As found in the case of the overall service, the relationship between web service

hosts and the web server processes implies a high-sensitivity rule set similar to this one.

When any resource is down, the service is down.

When any resource is degraded, the service is degraded.

After evaluating this relationship between the server devices and processes, it would seem

that you cannot easily extend the design in Example 1 to include supplemental monitoring

such as including the new process resource models. Because the initial design tried to

encompass a high level service with multiple components, it did not recognize that there

are subservices within the Customer Account Access Service. After extending the

monitoring to the process level, it becomes apparent that a web subservice and a database

subservice do exist. Much like the web servers, you can monitor the database service host

application using process models. A hierarchy is beginning to appear as the resource

monitoring is extended to the process level.

Service Hierarchy

CUSTOMER ACCOUNT

ACCESS SERVICE

WEB SERVICE DATABASE SERVICE

DEVICES PROCESSESDEVICES PROCESSES

It is very typical to discover lower level services which at first did not appear to be

significant enough to warrant a service model. In general, the service modeling process is

an iterative process. Each revision adds additional precision and extends the total number

of fault scenarios that can be correctly reported.

This iterative approach can be summarized in different ways. One way is to consider that

the goal of each revision is to enrich the root cause information which will be available in

the event of a service fault. Looking back to Example 1, if both web server devices were

available, but one web server process was down, the service would not have reported a

fault although service users would have experienced some performance degradation. By

extending the monitoring to the process level, the service would now report the

degradation and the process failure as the root cause. The next section shows how Example

2 can be implemented in SPECTRUM.

Implement Example 2 in SPECTRUM

The design for Example 2 includes the creation of four process models. Two of these

process models will monitor the web server application and the other two will monitor the

database server application. It is likely that a user may identify additional processes which

impact the availability of a particular service component. This approach can be extended to

include those processes as well.




To create the process models, you should locate the host model representing the server

machine on which the process is running. In the example, this would be a web server or

database server device model. If the agent on the device supports RFC 2790, you can

create process models for each process that you want to monitor.

To create a process model for each process

1. In the SPECTRUM OneClick console, list the host or hosts in the OneClick Contents

panel.

2. Select the host for which you want to create a monitoring rule.

3. Expand the System Resources section within the OneClick Component Detail view. A

subsection named Running and Monitored Processes appears.

4. Expand the Running and Monitored Processes view to show a section for Running

Processes, which, in turn, reveals a table of processes.

Note: If the text (RFC 2790) does not appear in the section names, the agent does not

support the RFC 2790 extension to MIB-II. You will not be able to monitor processes on

that host and raise alarms when processes start or stop.

a. Right-click a process in the table and select Monitor this process. The Add

Monitored Process dialog appears.

b. Select Alarm on Stop and click OK. Using this setting, the process model will

experience a critical alarm if the corresponding process is stopped. The process

appears in the Monitored Processes view.




5. After creating the appropriate process models, launch the Service Editor by selecting

Tools, Utilities, Service Editor, or by right-clicking a process and selecting Utilities,

Service Editor. The goal is to modify the Service that was created in example one to

handle this more complex situation by using similar steps as are outlined for example

one, but now with a deeper hierarchy, adding a new middle layer as well as adding

logic for the processes.

› Using the Condition value map and Redundancy rule set policy, create the service

Web Servers Redundancy Monitor, which watches the web server process

models.

› Using the Service Health High Sensitivity policy, create the Web Service, which

watches the Web Servers Contact Monitor and the Web Servers Redundancy

Monitor. This will require the reparenting of the Web Servers Contact Monitor

from the Customer Account Access Service to this service.

› Duplicate these tasks for the Database Server Redundancy Monitor and the

Database Service.

The Customer Account Access Service will now monitor the Web Service and Database

Service with the Standard Account Access Policy described in the implementation of

Example 1.

REVIEW EXAMPLE 2

Example 2 expanded the monitoring of the Customer Account Access Service to include

monitoring the actual web server process. This example also reveals two distinct

subservices within the Customer Account Access Service exist. Each of these subservices

consists of multiple resources which are monitored in different ways, as shown in the

Service Editor Hierarchy view below.

The table below displays the ever-increasing set of fault scenarios which can be supported

by the existing service modeling.




Service Health Matrix Fault Scenarios

Legend:

WSD Web server device

WSP Web server process

DBD Database device

DBP Database process

AAS Customer Account Access Service

DG Degraded service health

SD Slightly degraded service health

DN Down service health

WSD1 WSD2 WSP1 WSP2 DBD1 DBD2 DBP1 DBP2 CAAS

UP UP UP UP UP UP UP UP UP

UP DN UP DN UP UP UP UP SD

DN UP DN UP UP UP UP UP SD

UP UP DN UP UP UP UP UP SD

UP UP UP DN UP UP UP UP SD

UP UP UP UP DN UP DN UP SD

UP UP UP UP UP DN UP DN SD

UP UP UP UP UP UP DN UP SD

UP UP UP UP UP UP UP DN SD

DN UP DN UP DN UP DN UP DG

DN UP DN UP UP DN UP DN DG

UP DN UP DN DN UP DN UP DG

UP DN UP DN UP DN UP DN DG

UP UP DN UP UP UP DN UP DG

UP UP DN UP UP UP UP DN DG

UP UP UP DN UP UP DN UP DG

UP UP UP DN UP UP UP DN DG

DN DN DN DN UP UP UP UP DN

DN DN DN DN DN UP DN UP DN




WSD1 WSD2 WSP1 WSP2 DBD1 DBD2 DBP1 DBP2 CAAS

DN DN DN DN UP DN UP DN DN

UP UP DN DN UP UP UP UP DN

UP UP DN DN DN UP DN UP DN

UP UP DN DN UP DN UP DN DN

UP UP DN DN UP UP DN UP DN

UP UP DN DN UP UP DN DN DN

DN DN DN DN DN DN DN DN DN

The table indicates 25 different fault scenarios that can be reported with the

implementation of Example 2. Note the scenario in the row above that is bold. In this

scenario, all critical processes have failed. In this situation, the service is down, but it would

not have been reported as down by the implementation of Example 1.

Example 3: Extend the Service to Include a Response Time

Element

Example 2 enhanced the Customer Account Access service by extending visibility to the

process level. In some situations, the devices may be up and the processes are running,

but the service is not performing optimally. It is often useful to include some level of

performance monitoring as a resource of service components. This is particularly important

when the service health is intended to reflect what an end user is experiencing when using

a service.

In this example, you add a response time element to the Web Service component of the

Customer Account Access Service. Adding the performance element will not only enhance

the service monitoring, it will also test the modularity of the design produced in Example 2.

One goal of service design should be to produce services that you can easily enhance as

you gain more insight into how each service resource can be monitored.

Adding the response time component involves creating Response Time Test models in

SPECTRUM. Many devices and system agents are capable of supporting response time

tests. Since this example is intended to enhance the monitoring of the Web Service

component, you will be creating HTTP response time tests.

The number of tests can vary based on your design. It is generally a good idea to build at

least one HTTP request to each web server. For example, you could select two SPM test

hosts and create two HTTP tests on each. The test host should issue requests to each web

server. This would provide multiple request points to each individual server. The four new

response time tests will collectively comprise a new set of resources within the Web

Service.

You can take two typical approaches to monitoring response time tests:

Monitor the latest error status of each response time test model.

Monitor the aggregate result values of each test model.




The second approach is discussed in more detail later. For this example, you will monitor

the latest error status of each test model. The following table maps Latest Error Status

(Response Time) values to equivalent service health values. This process is used

extensively by the SPECTRUM Service Manager. The goal is to normalize pure attribute

values to comparable service health values which can easily be applied to various rule sets.

Service Health Matrix – Response Tests

Response Time Value Equivalent Service Health

OK UP

TIMEOUT CRITICAL

THRESHOLD CRITICAL CRITICAL

THRESHOLD MAJOR DEGRADED

THRESHOLD MINOR SLIGHTLY DEGRADED

Under some circumstances, documentation might indicate acceptable response time levels.

If this is not the case, a useful approach is to create response time tests without thresholds

and review the latency results over a period of time. This will help you to establish baseline

threshold values to ensure that an unusual latency value would result in a threshold

violation.

In Example 2, the Web Service was developed to include a resource monitor for the contact

status of the web server devices and a resource monitor for the condition of the web server

process models. It may be possible to extend the monitoring of the Web Service to include

the response time component by simply adding a third resource monitor which monitors theresponse time test models.

When monitoring these response time test models, the following rule set might be

appropriate:

When all resources are down, the service is down.

When any one resource is down, the service is degraded.

When all resources are degraded, the service is degraded.

Consider how this rule set would apply to a set of response time tests as described above.

If all response time tests experienced a timeout or critical threshold violation, it would

indicate that neither web server was capable of responding. Clearly, this is a criticalscenario and should indicate a down service health.

If any one response time test timed out or violated a critical threshold, it would indicate

that one of the web servers was impacted to such a degree that it could not adequately

handle requests. Considering that some of the other response time tests are succeeding,

it can be surmised that the service is not entirely down, but it is degraded.

If none of the tests were timing out or violating a critical threshold, but all were violating

a major threshold, you could assume that the service health is degraded.




Based on the configuration established above, you could enhance the web service by adding

a new resource monitor for the response time tests. The web service component would

function correctly under any of the scenarios described above. By monitoring its resources

with the Service Health High Sensitivity policy, any resource monitor that is down would

cause the Web Service to go down. Likewise, any resource that is degraded would cause

the Web Service to also degrade. It turns out that the design produced in Example 2 can

easily be extended to include the response time element. The following is an example of what the service hierarchy would look like after the addition of the response time

component to the Web Service.

CUSTOMER ACCOUNT

ACCESS SERVICE


DEVICES PROCESSESDEVICES PROCESSESRESPONSE TIME

IMP LEMENT EXAMPLE 3 IN SPECTRUM

For this example, you will create four HTTP response time tests. You can locate response

time test hosts using the Locator tab in the OneClick Console. The Locator menu has a set

of pre-configured SPM Searches.

Note: To run an HTTP test, you must discover test sources such as SystemEDGE Service

Availability agents, Cisco IP SLA-enabled routers, and Network Harmoni agents using read-

write community strings. For details about response testing and supported agents, refer to

the Service Performance Manager User Guide.

To create response time tests

1. Use the All Test Host search to locate test host models that can measure HTTP

response time to the web servers. (In the Contents panel, expand SPM Searches and

Test Hosts By; then right-click All Test Hosts and select Launch the selected search.)

2. From each designated test host, create new HTTP tests by right-clicking the host in the

table, choose New Test, then select HTTP.

3. Specify the threshold data. Configure the thresholds to ensure that a critical threshold

is generated when the response time is too slow to be usable, and a major threshold is

generated when response time is usable but very slow. Add the destination for the test,

which would be one of the web server hosts.

4. To add the response time tests, use the Service Editor to add a new resource monitor,

Web Server Response Monitor, which uses the Response Time High Sensitivity policy.

The resources can be located by expanding SPM Searches. You can then add the four

response time tests to the new resource monitor. Finally, attach this new resource to

the Web Service.




REVIEW EXAMPLE 3

Example 3 describes how you can extend an existing service implementation to include

more sophisticated resource monitoring without altering the service hierarchy. This

flexibility in the service hierarchy design makes it very easy for users to continually

enhance their service models. In addition, Example 3 outlines how to incorporate a

response time component within a service to greatly enhance the accuracy of service health

reporting. Again, this iteration expanded the set of fault scenarios support and enriched the

set of potential root causes of service impact.

Create SLAs

This section introduces SLA modeling concepts and techniques that you should understand

before modeling SLAs.

Note: The following sections up to and including Example 4 provide instructions for tracking

SLA based on business hours. This functionality is included in SPECTRUM Release 8.1. For a

complete description of the available capabilities, see the Service Performance Manager

User Guide.

Key Concepts

SLA Periods SLAs consist of a set of service level objectives or guarantees that are

measured for a given period of time. Commonly, this period of time coincides with a well-

defined billing cycle or a reporting cycle. Frequently, an SLA period will be monthly (that

is, the compliance of an SLA is evaluated on a month-to-month basis). Typically, the

compliance or violation of an SLA will be expressed in terms of a particular period. For

example, you might consider an SLA compliant for the month of January. If the period

was weekly, you might consider an SLA violated for the week of November 5-11.

SLA Guarantees or Service Level Objectives Among other stipulations, an SLA will

include a set of guarantees or service level objectives. In particular, many of these

guarantees relate to the availability and performance of a particular service or set of

services. In typical service provider environments, SLAs often state very specific

guarantees. Users may find stipulations similar to the following: “… certifies uptime at

99.9% monthly…” or “…will credit the customer 1/30th of the monthly service fee in the

event that the customer reports a service outage of 30 minutes or more…” These

statements represent guarantees given by the provider of a particular service. Within the

enterprise environment, SLAs also exist, although an enterprise SLA may be less formal.

It is very common to find SLAs such as “…the IT department guarantees no more than 30

minutes of web access down time per week…” In either case, it is these guarantees or

service level objectives which provide the basis for determining SLA compliance with

SPECTRUM.

Active SLA Monitoring Unlike other SLA management products, SPECTRUM Service

Manager provides active SLA management. This means that within a given period, you

are able to determine the status of the SLA for that period. Based on outage trends, you

are provided a projected status for the overall period. At the beginning of each SLA

period, an SLA is considered unaffected. The unaffected status will persist until some

form of outage causes the SLA to record outage time for the period. An SLA which has

recorded outage time, but is not at a significant risk for a violation, is considered to have

a compliant status.

If additional outage time occurs within the period and the outage time accumulates to

levels where the SLA is approaching violation, the SLA will transition to a status of




warned. If outage time for the period continues and specific guarantee thresholds are

reached, the SLA will transition to a status of violated. This transition from unaffected to

a violated state happens at real time. If the SLA period is monthly, and the SLA violates

on the fifth day, you will be aware as soon as the SLA is violated as opposed to waiting

for a report at the end of the period to indicate a violation. Consequently, this active SLA

monitoring allows service providers to take action before the SLA becomes violated.

SLA Time of Enforcement or Business Hours Within the SLA period, a particular

guarantee will be enforced frequently. The SLA may contain statements such as

“…guarantees no outage exceeding 30 minutes between the hours of 8AM and 5PM on

Monday through Friday…” A statement such as this is commonly known as a business-

hour guarantee. Sometimes, multiple guarantees will be based on particular timeframes.

For example, “…guarantees 97.5% availability on a 7x24 basis, with 99.9% availability

between the hours of 8AM to 5PM Monday-Friday, and 8AM to 12 PM on Saturday…”

Although the same service is being measured, this statement actually includes two

guarantees: one guarantee with a 7x24 timeframe, and a second guarantee for specific

hours during the week.

Create SLAs and Guarantees

The first step in creating an SLA is to understand the particular service with which it is

associated and the period during which the SLA will be in effect. The service modeling

hierarchy often has some top-level service model which is logically associated to an SLA.

For example, in the service provider environment, a high-level service such as Customer A

High Speed Data may exist.

Logically, the SLA is a binding of the high-speed data service which is being provided to

Customer A. The particular period may be stipulated in a SLA document or may be

determined arbitrarily, but it must be a timeframe which is agreed upon by both the service

provider and service customer. Monthly SLA periods are very common as they frequently

coincide with a service billing cycle. For example, an SLA period may be in effect from the

first of the month with guarantees based on availability and performance for that month.

Commonly, an SLA will specify restitution guarantees if a customer contacts the service

provider regarding a dispute within a certain number of days from the end of a given

period.

Once the top level service and SLA period have been determined, the user should identify

the SLA guarantees or service level objectives related to the availability and performance of

the service being provided. Often, you can find these guarantees within the SLA document

among other stipulations that are not within the scope of measuring the availability or

performance of a service. You should look for those statements which specify a level of

availability, a guaranteed response time, acceptable level of latency, and so on. In addition,

you should determine if those statements are accompanied by statements that dictate

specific times within the SLA period as to when they are guaranteed.

Having identified the guarantees within an SLA, you should categorize them into availability

and response time guarantees. Availability guarantees within the SLA are frequently

specified as a percentage of availability. However, availability guarantees may be described

in terms of downtime. For example, “…no more than 1 hour of outage time…” Response

time guarantees can be identified either by specific statements such as “…2000 ms or

better response time…” or “…latency not exceeding 5000 ms for more than 30 minutes…”

Availability can be described in a couple of different ways. Previous sections of this

document discussed service health. Typically, you could describe availability as a service




being available when it is not down or a service being unavailable when it is down.

However, an availability guarantee might also be described as a service being unavailable

when it is not responsive. This second description can be very important when building

guarantees models.

Response time guarantees measuring services which utilize response time components as

their resources. An interesting point regarding response time guarantees is thatcomponents within a service hierarchy may monitor response time specifically for the

purpose of providing a way to support an SLA’s response-time guarantee. This is actually a

very common scenario.

Frequently, a service hierarchy is built on the foundation of resources that actually comprise

the physical devices and applications providing a user-consumable service. Response-time

tests, despite providing an excellent way to report service health, are often not identified as

service resources until an SLA is applied that stipulates response time guarantees. As

mentioned in previous sections, you can use response-time monitors to identify high

latency or service degradation. The response time tests should report a major threshold

violation when latency exceeds an acceptable level. In addition, you can also use response-

time monitors to report a critical condition when latency reaches an unusable level or

response-time requests time out. Considering this when response time monitors are built,

they can support both the notion of monitoring latency and monitoring availability.

In the case in which a service is considered unavailable when it is not responsive, although

a service designed to report availability will never be guaranteed for response time, a

service designed to report response time can also be used to measure availability.

Example 4: An SLA for the Customer Account Access Service

This section contains an example based on the Customer Account Access Service from the

previous section. It includes an SLA and several guarantees.

In the “Creating Service Models” section of this chapter, you implemented the Customer

Account Access Service. For this example, the Customer Account Access service willrepresent the service being provided by a fictional company called Northeast Data Solutions

(hereafter, referred to as Northeast).

Northeast maintains customer account information for a large number of small businesses.

Each small business is responsible for creating and maintaining its own customer data.

Northeast takes responsibility for supporting and securing the customer account data. In

addition to supporting the databases and web access, Northeast also negotiates with

various Internet Service Provider (ISPs) to provide a local routing device for the remote

customer site to ensure that customers will have reliable internet access to their customer

account information. The relationship with the ISP is transparent to Northeast customers.

They pay Northeast directly for service.

The following items are segments of an SLA provided to each Northeast Data Solutionscustomer:

Northeast Data Solutions provides access to customer account data guaranteeing that

account access for each customer location will be available 99% of each month excluding

those periods of scheduled system maintenance to be conducted between the hours of

12AM to 3AM on each Sunday.




Service availability to be restored such that the average outage resolution time is 30

minutes or less, with no individual outage exceeding 1 hour; outages are guaranteed to

not exceed a rate of two or more outages within any 24-hour period.

A standard business hours timeframe to be defined as the hours of 7AM to 6PM Eastern

Standard Time on the days of Monday through Friday of each week.

Within standard business hours, account access to be guaranteed available at 99.5% withno individual outage exceeding 20 minutes.

Average transaction time for initial account access is not to exceed five seconds for more

than 5% of the standard business hours timeframe, with successful transaction

completion to be guaranteed at 99% for standard business hours. With a transaction

deemed successful if completed within 15 seconds, no period of transaction failure shall

persist for more than 20 minutes.

Transaction monitoring average based on a sampling of five queries to be delivered

randomly within a five-minute interval during standard business hours, each query

originating from the customer access point device.

Northeast assumes responsibility for an access device assuming the device is operational

with the exception of power failure or an act of nature deemed beyond the control of

Northeast.

In this example, the SLA text includes a variety of guarantee metrics and terminology that

allows for fictional representative statements such as those found within an actual SLA.

Despite its confusing terminology, this SLA actually includes some very precise guarantee

information, including how response time will actually be measured.

This SLA would be provided for each Northeast customer, but this example will focus on the

SLA between Northeast and a customer called A to Z Performance Components, which has

offices in Atlanta and Savannah, Georgia.

As mentioned above, the first step to designing an SLA implementation is to determine

which service supports the SLA and identify the period. The hierarchy below represents the

Customer Account Access Service.

CUSTOMER ACCOUNT

ACCESS SERVICE


DEVICES PROCESSESDEVICES PROCESSESRESPONSE TIME

Many components are required to monitor A to Z’s service availability. In addition to

providing web access and database access, Northeast must now build service components

that monitor availability and response time specific to A to Z’s Atlanta and Savannah

offices.

These new service components will monitor access routers at each site and response time

for newly created response time tests that are hosted on the access routers at each site.

The following figure shows how you might extend the hierarchy to support A to Z.




A to Z ACCOUNT

ACCESS

A to Z Site Access A to Z Site

Response Time

Atlanta Response

Time

Savannah Response

Time

Atlanta Routing Savannah Routing

Customer Account

Access

Evaluation of the SLA implies that a variety of guarantees exist and that some new service

models will be required. The chart above represents one possible configuration to use. You

should review each SLA implementation carefully to determine the best way to organize

services.

Among the new services is a hierarchy called A to Z Site Access. A to Z Site Access has two

subservices called Atlanta Routing and Savannah Routing. These services are designed to

monitor the on-site router which provides access to the Customer Account Access Service.

You can break down each one of these subservices into a set of resource monitors,producing a hierarchy similar to the figure below.

Routing

Router Interfaces

One resource monitor watches the contact status of the router device model, while the

other resource monitor watches the port status of interfaces on the router which are critical

for providing access for the office. The routing service is considered down if the router is

down or if all required interfaces are disabled. A similar service would be implemented for

both Atlanta and Savannah. In reference to the SLA, the following statements are related to

routing components of each site:

Northeast provides access to customer account data guaranteeing that account access for

each customer location will be available 99% of each month excluding those periods of

scheduled system maintenance to be conducted between the hours of 12AM to 3AM on

each Sunday.

Service availability to be restored so that the average outage resolution time is 30

minutes or less, without an individual outage exceeding one hour; outages are

guaranteed to not exceed a rate of two or more outages within any 24-hour period.

Guarantees apply on a per-site basis. For the service manager user, consider offering 99%

availability for the month of November. This implies that 432 minutes of downtime are

allowed. When building the SLA, carefully consider the service or services to which this

should be applied. If this guarantee was applied to the A to Z Site Access service, and

Atlanta experienced 300 minutes of downtime and Savannah experienced 200 minutes of

downtime (for a total of 500 minutes of downtime), the SLA would be violated. However,

the wording in the SLA states “..each customer site..”, so a guarantee should be applied at

each site. By applying the guarantees in this manner, the SLA would not be violated as




neither site experienced more than 432 minutes of downtime. With regard to the

availability of the Routing service, two separate guarantees of 99% apply:

Atlanta availability 99%

Savannah availability 99%

The SLA also states “…average outage resolution time is 30 minutes or less...” In addition

to the 99% availability guarantee, a supplemental guarantee which specifies an averageoutage time of 30 minutes or less is unnecessary. This component can be added to the

availability guarantees as a MTTR supplement. The availability guarantees should now

include the MTTR component:

Atlanta availability 99%, MTTR 30 minutes

Savannah availability 99%, MTTR 30 minutes

In addition to the MTTR component, the SLA states “…outages are guaranteed to not

exceed a rate of 2 or more outages within any 24-hour period…” This statement is referred

to as a Mean-Time-Between-Failures (MTBF) clause. The MTBF clause states that more than

one outage per day cannot occur. The availability guarantees should now include the MTBF

component:

Atlanta availability 99%, MTTR 30 minutes, MTBF 24 hours

Savannah availability 99%, MTTR 30 minutes, MTBF 24 hours

A similar guarantee should be applied to the Customer Account Access Service; however,

this guarantee will be independent of either customer site:



Customer Account Access availability 99%, MTTR 30 minutes, MTBF 24 hours

In addition to the 99% overall availability guarantee, consider these additional availability

specifications:

A standard business hours timeframe is to be defined as the hours of 7AM to 6PM EasternStandard Time Monday through Friday of each week.

Within standard business hours, account access will be guaranteed available at 99.5%

with no individual outage exceeding 20 minutes.

Business-hour guarantees can be created by applying a schedule during creation. A weekly

schedule for the days Monday through Friday from 7AM to 6PM will be applied to new

guarantees ensuring 99.5% availability throughout the scheduled period.

The new guarantees should be applied to each customer Routing Service and the Customer

Account Access Service:




Atlanta availability 99.5% M-F 7AM-6PM

Savannah availability 99.5% M-F 7AM-6PM

Customer Account Access availability 99.5% M-F 7AM-6PM




An additional stipulation to the business-hours guarantee must be accounted for: “no

outage can exceed 20 minutes…” This stipulation is referred to as a Maximum Outage Time

(MOT) clause. The business-hours guarantees should also include the MOT component:




Atlanta availability 99.5% M-F 7AM-6PM, MOT 20 minutes

Savannah availability 99.5% M-F 7AM-6PM, MOT 20 minutes

Customer Account Access availability 99.5% M-F 7AM-6PM, MOT 20 minutes

To this point, six different availability guarantees have been identified, but none of these

guarantees account for the response time element within the SLA:

Average transaction time for initial account access is not to exceed five seconds for more

than 5% of the standard business hours timeframe, with successful transaction

completion to be guaranteed at 99% for standard business hours. A transaction will be

deemed successful if completed within 15 seconds, and no period of transaction failure

shall persist for more than 20 minutes.

Transaction monitoring average based on a sampling of five queries to be delivered

randomly within a 5-minute interval during standard business hours, each query

originating from the customer access point device.

The response time stipulations are very thorough and dictate how response time will be

measured. To support this component of the SLA, you need to create two additional

services using the response time tests as monitored resources.

Create an Atlanta Response Time Service to monitor five new response time test models

which will be hosted on the Atlanta access router and run at five-minute intervals. The SLA

specifies 5 seconds and 15 seconds as major and critical thresholds. At first, you might

consider that each response time test (SPM test) should be configured with a 5-second

major threshold and a 15-second critical threshold; however, the SLA wording suggeststhat this would not be appropriate. Note the wording “…Average transaction time for initial

account access to not exceed 5 seconds…” The average response time should be monitored,

rather than the individual response time of each test.

Imagine a response time result set of 4 seconds, 3 seconds, 3 seconds, 3 seconds, and 6

seconds. The 6-second result is in violation of the 5-second threshold. However, the

average response time is less than 4 seconds, which is not in violation. To support this

behavior, you should not set thresholds on the individual response time tests. Instead,

create a new Response Time Service with a policy to monitor the latency of the response

time tests. The new service policy should be created to monitor the Latest Result attribute

on the response time test models. The policy should apply an aggregate rule set when

evaluating response times as follows:

When the average for all resources is greater than 15000, the service is down.

When the average for all resources is greater than 5000, the service is degraded.

The Latest Result attribute value of a response time test model is the number of

milliseconds that the most recent test took to complete. Therefore, the values in the service

policy above are expressed in terms of milliseconds. The Atlanta Response Time and

Savannah Response Time Services should both utilize this policy to monitor five response

time tests hosted by the respective site access router.




Referring back to the response time specifications in the SLA, two business-hours response

time guarantees are needed to track response time of the services created above:

Atlanta response time 95% M-F 7AM-6PM

Savannah response time 95% M-F 7AM-6PM

An additional availability component is included with the response time stipulation:

…successful transaction completion to be guaranteed at 99% for standard business hours

A transaction is deemed successful if completed within 15 seconds, and no period of

transaction failure shall persist for more than 20 minutes.

Recalling the second definition for an availability guarantee, “a service is unavailable if it is

not responsive”, the specification in the SLA above requires two additional 99% availability

guarantees with a supplemental MOT component:

Atlanta response time 95% M-F 7AM-6PM

Savannah response time 95% M-F 7AM-6PM

Atlanta availability 99% M-F 7AM-6PM, MOT 20 minutes

Savannah availability 99% M-F 7AM-6PM, MOT 20 minutes

A Maintenance window clause within the SLA should also be considered:

…excluding those periods of scheduled system maintenance to be conducted between the

hours of 12AM to 3AM on each Sunday…

To account for maintenance windows, modify each service to include a maintenance

schedule for that time period.

The following table lists all SLA components that have been accounted for in this design:

SLA Design Components

SERVICE: SLA COMPONENT:

A to Z Account Access Monthly SLA

Customer Account Access Availability 99%, MTTR 30 minutes, MTBF 24 hours

Availability 99.5% M-F 7AM-6PM, MOT 20 minutes

Atlanta Routing Availability 99%, MTTR 30 minutes, MTBF 24 hours


Savannah Routing Availability 99%, MTTR 30 minutes, MTBF 24 hours


Atlanta Response Time Response time 95% M-F 7AM-6PM

Availability 99% M-F 7AM-6PM, MOT 20 minutes

Savannah Response Time Response time 95% M-F 7AM-6PM

Availability 99% M-F 7AM-6PM, MOT 20 minutes




How You Implement the A to Z Account Access SLA in SPECTRUM

To implement the SLA design described in the previous section, follow these high-level

steps:

1. Create the two Routing Services and their resource monitors for Routers and

Interfaces. This results in 6 services grouped into two hierarchies.

› For each site use the Service editor to configure the first resource monitor

(Atlanta Router and Savannah Router) to watch the contact status of the access

router device model with the contact status high-sensitivity policy.

› For each site configure the second resource monitor (Atlanta Interfaces and

Savannah Interfaces) to watch the port status of any critical router interface

using a Port Status policy. Consider using either the Low Sensitivity or

Percentage rule set for the service policy depending on the number of interface

models required to provide access.

› Use the service editor to create the Atlanta Routing and Savannah Routing

Services, which monitor the service health of their two resource monitors

(defined above) using the Service Health High Sensitivity Policy.

2. Create the Atlanta and Savannah Response Time Services and their individual response

time test (SPM) models.

› Use the OneClick Console’s Locater tab to configure 5 SPM test models for each

site with the following settings:

A 5-minute (300 seconds) schedule interval and thresholds disabled

A timeout value of 25-30 seconds

Filter Timeout Data set to FALSE to configure the test models to

have the timeout value written to the Latest Result

› Use the Service Editor to create the Atlanta Response Time and Savannah

Response Time Services monitor and ensure that they monitor the newly created

response time test (SPM) models.

› Using the Service Policy Editor, configure each Response Time Service to use a

new service policy which monitors the response time test’s Latest Result attribute

and the following aggregate rule set:

When the average for all resources is greater than 15000, the

service is down.

When the average for all resources is greater than 5000, the service

is degraded.




3. Consolidate the Routing, Response time, and the previously created Customer Access

(from example 2) service into higher-level services.

› Consolidate the Routing and Response Time Services at both customer sites

under two new services: A to Z Site Access and A to Z Response Time. The two

new services should monitor their respective components with a Service Health

High Sensitivity policy.› Create the A to Z Account Access Service to monitor the A to Z site Access, A to

Z Response Time, and Customer Account Access services. The new services

should monitor its components with a Service Health High Sensitivity policy.

4. Set up the SLA rules.

› After completing the changes made to the service hierarchy, navigate to the SLA

tab within the Service Editor. Create an SLA against the A to Z Account Access

service using a monthly SLA period. Do not at this time create guarantees

against the A to Z Account Access Service.

› Launch the Guarantee Editor with the new SLA highlighted.

› Use the Guarantee Editor to create each of the ten guarantees (8 availabilityguarantees and 2 Response Time guarantees) that are identified in the previous

section.

› Apply the MTTR, MTBF and MOT specification to the appropriate guarantees.

Note: The functionality to associate business hours to these guarantees is planned

for SPECTRUM 8.1. Disregard the Business Hour restriction for releases prior to 8.1

An SLA guarantee will be violated if the following occurs:

The threshold for any guarantee is violated.

The threshold for any supplemental guarantee is violated (that is, MOT, MTTR, MTBF).

When the MOT threshold is violated, the supplemental guarantee will immediately violate

the SLA. If the MTTR or MTBF threshold is violated, the guarantee will transition the SLA to

a state of “at risk” because the final determination of whether MTTR or MTBF has been

violated cannot be made until the end of the SLA period.

The SLA status is equivalent to the status of its worst guarantee. If any guarantee is

violated, the SLA will likewise be violated for the SLA period. When the SLA period rolls

over, the SLA will transition back to a state of “unaffected.”

An SLA guarantee will be violated if the following occurs:

The threshold for any guarantee is violated.

The threshold for any supplemental guarantee is violated (that is, MOT, MTTR, MTBF).

When the MOT threshold is violated, the supplemental guarantee will immediately violatethe SLA. If the MTTR or MTBF threshold is violated, the guarantee will transition the SLA to

a state of “at risk” because the final determination of whether MTTR or MTBF has been

violated cannot be made until the end of the SLA period.

The SLA status is equivalent to the status of its worst guarantee. If any guarantee is

violated, the SLA will likewise be violated for the SLA period. When the SLA period rolls

over, the SLA will transition back to a state of “unaffected.”




Service and SLA Reporting

Service Availability and SLA Reports are a major component of the Service Management

solution. These reports complement the service and SLA modeling process and provide

insight into the performance of service components over a variety of time periods. Service

and SLA reports can be categorized into two groupings: customer-facing and internal.

Customer-facing reports provide service availability and SLA status information, and can be

delivered to service customers. Frequently, SLAs will stipulate that customers will receive

Service and SLA reports for each SLA period. Customer-facing reports tend to summarize

status. For example, a customer-facing availability report would only show two metrics:

available time or down time. Likewise, a customer-facing SLA report would only show two

metrics: compliance or violation.

Internal reports are designed to provide a rich set of detailed data for use by the service

provider or enterprise customer. In contrast to the customer-facing service availability

report, an internal service availability report would display maintenance time, loss of

management time, etc. Similarly, an internal SLA report would include all possible SLA

states including unaffected, compliant-warned, and violated.

Other internal reports may summarize services with the greatest downtime or service

resources which contribute the most downtime. Internal reports are intended to provide

insight into the health and performance of their Services and SLAs over a period of time.

Run SPECTRUM Service Manager Customer-Facing Reports

Several different customer-facing reports within the Service Availability and SLA category

are available to Service Management users. You use the SPECTRUM Report Manager

application to generate and manage your reports. You can access the application from any

computer that can connect via a web session to the OneClick server on which Report

Manager is installed.

To access Report Manager and run reports

1. Point a web browser to the Report Manager web page using the URL

http://hosthame /spectrum/repmgr, wherehostname is the name of the OneClick and

Report Manager system.

2. Log in to the application by specifying your username and password in the OneClick

login window.

3. Click the Begin Session link on the Report Manager Welcome window.

The Report Manager main window appears. The main window provides access to all report

and report management options for your account. It lists any scheduled reports that have

been generated for your account, reports that are scheduled to be generated for your

account, and any messages in the Message of the Day and What’s New text boxes posted

by a Report Manager administrator. For a complete description of Report Manager and how

to use it, see the Report Manager User Guide.

The following sections describe the customer-facing reports.




Service Availability by: Name, Customer, Owner

This report includes a pie chart showing service up/down time and availability percentage

based on the period for which the report was run. A table listing all down shows start time,

end time, duration, and outage notes. In addition, a subreport with detailed outage

information is available for any outage with the table. You can generate multiple service

availability reports by service name, service customer, or service owner.




Service Availability Variable Health Level

This report is similar to the Service Availability report, but allows you to include degraded

and slightly degraded time if you choose. A pie chart including all service health types is

shown with availability percentage calculation based on the period for which the report was

run. All included outages are listed in the subreport showing detailed outage information.




Service Summary by: Name, Customer, Owner

This report lists multiple services based on service name, service customer, or service

owner with outage times and percentage of availability.

Service Summary Variable Health Level

This report provides a table of services with columns that display summarized data for each

service health level that you choose to include in the report, similar to the previous report.

You can choose to display down only; down and degraded; or down, degraded, and slightlydegraded. For each service listed in the table, a subreport with more detailed outage

information is available.




SLA Detail By Customer

You can generate this report for one or more SLA periods. The report includes a pie chart

which displays the percentage of guarantees for all reported periods which are compliant or

violated. Below the chart, each of the guarantees is reported, including the status for each

period. For any period, you can open a subreport showing detailed outage information for

the particular guarantee, including any outage exemptions. If you run the report based on

customer, a separate report will be generated for each of the customer’s SLAs. You can

provide the report to the customer at the end of each period.




SLA Inventory by Customer

This report shows the configuration of each SLA and guarantee for a particular customer.

This is a useful report to generate for a customer when the SLA and guarantee models are

first created. The user should be able to compare the configuration with the SLA document

to verify that all guarantees or service level objectives are addressed.

SPECTRUM Service Manager: Internal Reports

Several different internal reports within the Service Availability and SLA category are

available to Service Management users. The following sections describe the internal reports

Service Health by Service Name

This report is very similar to a service availability report, but includes all service health

levels including maintenance and loss of management. The report can be run for both

services and resource monitors. A pie chart showing the percentage of each service healthvalue is shown. A table showing outage of all service health types, including outage notes

and links to detailed outage information, is also included in this report. The service health

report provides service manager users with very detailed information regarding the

performance of a service over a given period of time.




Service Inventory

This report shows a breakdown of all services, resource monitors, and resources which are

modeled in the system. It can be used to preserve a “snapshot” of service inventory

configuration for the current time.




and links to detailed service availability reports are available for each service model. The

report provides very detailed information for the service manager user indicating which

services experienced the most overall outage time for a particular period.

Top N Worst Service Outages

This report allows you to view the top N worst service outages which resulted in service

downtime. This report is a useful tool for summarizing the worst outage for a period of time

and may highlight areas within the service hierarchy which are lacking the redundancy to

prevent service downtime.




Top N Worst Service Resources by Total Downtime

This report shows summarized information regarding the total service downtime caused by

individual resources. This highlights the cumulative effect of each individual resource

outage which results in downtime for one or more services. This report can be an important

tool for identifying service resources which are chronic problem areas within the service

modeling hierarchy.

SLA Status Current and Recent by Customer

This report provides you with a quick way of obtaining summarized SLA status for the

current and recent periods. Status includes unaffected, compliant, warned, and violatedSLAs with detailed subreports showing results for specific guarantees. This report can be

run for selected SLAs or SLAs for a specific customer. This report can provide a quick

review of the status of many SLAs for any customer.




SLA Summary by: Name, Customer, Status

This report produces a table of summarized SLA status for one or more periods. The report

can be generated by SLA name, customer name, or simply organized by status. The report

provides a summarized reference for multiple SLAs or multiple periods. You can access

detailed subreports showing results for specific guarantees.

SLA Summary Warned or Violated

This report produces a table of all SLAs that are currently in the warned or violated state.

The table also provides access to a subreport showing detailed guarantee outage

information for the current period. This is a useful report for the service manager user to

view SLAs that are not performing well for the current period.




SLA Detail By: SLA Name, Time Range, Last N Periods

This report is similar to the SLA Detail By SLA Customer report except that it displays all

SLA Status values including unaffected, compliant, warned, and violated states. Detailed

information is provided in a subreport which includes guarantee outages for the particular

period. This is useful for obtaining detailed information about individual SLAs for one or

more periods.




PERIOD DETAIL SUBREPORTS




SLA Detail with Resource Outages

This is a complex report that brings together SLA status and the associated resource

outages which ultimately impacted the SLA for a specific period. This report is useful when

used in conjunction with the Top N Worst Resources By Total Down Time report. You can

use this report to show the impact of a particular resource at a very high level. Because it

provides a great deal of information, it may generate many pages of data for SLAs with a

high number of resource outages.




Customer SLA Summary

This report shows the status of the last six SLA periods for all customers’ SLAs. The status

includes all four values. For each SLA, a chart summarizing six periods of status information

is shown within a table providing summarized information for each period and a link to

more detailed guarantee outage information. This report provides service managers with a

quick view of SLA performance for a specific customer over the last six periods. The report

may also be used by the sales organization to verify if a customer’s SLAs have been met for

recent periods.




Chapter 8: Proactive Service

AssuranceThe CA Network and Voice Management solution has embedded algorithms that help

operations teams to identify growing problem areas within the infrastructure before they

impact customer service. Problems rarely occur instantaneously; often, warning signs occur

such as subtle but growing service degradations, increasing errors, and delays. These

problems might not be serious enough for users to notice or to warrant the opening of

service calls, but they are growing.

With tools that can analyze and detect growing problems and raise warnings, operations

teams can proactively fix the problems before they result in outages and interrupted or lost

service. This capability is particularly important for SLA enforcement. If you can resolve SLA

troubles before the SLA is violated — and without requiring additional network resources or

servers — you can avoid excessive effort and expense.

Prerequisites: The procedures in this chapter assume that you have installed SPECTRUMand eHealth, and that you have configured Live Health to send traps to SPECTRUM. For

more information about configuring the integrated product solution, see Chapter 5. For

details on the Live Health application and how to create monitoring rules, see the Live

Health web help. The eHealth web help is installed on the eHealth system and is also

available on the TotalDoc online documentation CD.

How You Identify Potential Problems

The proactive analysis of the eHealth Live Health application and the Health report

exceptions analysis are the key tools that warn you about growing problems in your

network. For converged networks, the eHealth for Voice Policy Manager identifies when

voice and messaging problems are starting. All of these tools provide configurablethresholds and settings so that you can define when a problem is serious enough to merit

proactive attention.

You can configure these tools to automatically watch for these growing problems and send

alarms to SPECTRUM when the problems require attention. In addition, you can define how




long the behavior must be occurring before alarms are raised so that you can reduce the

“false” alarms of simple threshold violations and focus on the real, continuing situations.

For example, the Live Exceptions application of the Live Health product family provides

notifications of potential delay, failure, and unusual workload problems within networks,

systems, and applications. It uses the historical data that eHealth gathers and maintains to

assess potential problems over time. When Live Exceptions detects a condition that meritsoperator attention, it raises an alarm and sends it to SPECTRUM.

Configure Live Health to Watch for Grow ing Problems

For proactive service assurance, you use the Time over Threshold and Deviation from

Normal algorithms within Live Exceptions to watch for growing problems in service. When

performance changes from what is considered “normal” behavior (based on past history) for

a particular length of time, Live Health raises an alarm and can send that alarm to

SPECTRUM.

To configure Live Health for proactive service assurance

1. Use the Live Exceptions Browser to associate the applicable Unusual Workload default

profiles to groups or group lists of your managed resources. Use the Live Health profile

descriptions tool on http://support.concord.com/devices to identify the correct profiles

for the element types that you have discovered. For more information about associating

profiles to groups, see Chapter 5.

2. If you have custom SLAs, you can create custom profiles with Time over Threshold and

Deviation from Normal alarms to reflect your service thresholds. Make sure that your

rules are configured to warn you when the service degradations require attention,

which will typically be at a threshold that is lower than your service agreement

thresholds. For instructions on creating custom profiles, see the Live Health web help.

Configure Health Reports to Send Traps for Growing Problems

The Exceptions section of a Health report contains information about elements that have

experienced unusual events or that may not have sufficient resources to accommodate the

demand that is placed on them. This section of a Health report identifies elements that

have accumulated a high number of exception points as the result of errors, high utilization

and divergence from trends. Elements appear in the report only when their accumulated

exception points exceed a minimum number. eHealth administrators can specify this

number in the service profile for the report.

As an additional means to proactively monitor service, you can configure Health reports to

forward traps for Health exceptions to the SpectroSERVER. When the scheduled Health

report runs, eHealth sends an SNMP trap to the SpectroSERVER for the leading problem for

each element in the Exceptions section of the Health report. Trap-forwarding is not enabled

by default for eHealth; you must create a custom Health report to enable this feature, andthen schedule that Health report to run automatically.

Note: Only scheduled Health reports forward exceptions. If you manually run a Health

report, it will not forward exceptions.

For instructions on creating a custom Health report that forwards exceptions as traps to

Live Health, see Chapter 5.




Send Voice Alerts to SPECTRUM

eHealth for Voice Release 4.0 provides alarm integration and correlation with SPECTRUM,

taking advantage of the SPECTRUM Service Management and voice modeling capability. You

can configure the eHealth for Voice Policy Manager application to send SNMP traps to

SPECTRUM when violations of QoS or GoS occur. SPECTRUM applies its intelligence on

policy, models, and rules to identify the severity of the problem.

Policy Manager monitors all data voice and PBX activity, and then reviews that data against

pre-defined criteria. With Policy Manager, you can define rules or policies against any data

— configuration changes, system traffic, individual usage, alarms, historical events, and so

on. For instructions on configuring eHealth for Voice to send performance traps to

SPECTRUM, see Chapter 5.

How You Respond to Alarm Actions in SPECTRUM

Using the SPECTRUM OneClick console, network operators and managers can view the

models (or resources) in their topology and watch for events or status changes that indicate

growing problems in their network. When SPECTRUM receives a trap from eHealth, the

model that represents the element changes color to represent the alarm severity of the trap

that was received. For example, when critical problems occur, the device icon changes to

red, while minor problems cause it to change to yellow.

Operators can right-click the icon and take the following actions to troubleshoot or

investigate the problems:

Drill down to an eHealth Alarm Detail report to obtain a picture of the performance trends

that caused Live Health to detect performance problems that required an alarm to be

raised. For example, if a device has performed outside its normal operating thresholds for

more than 15 minutes, Live Health Alarm Detail reports can show you the performance

trend line for the element.

Run an At-a-Glance or Trend eHealth report to review the performance history of theresource. While a Trend report shows you the performance of the specific problem

variable, the At-a-Glance shows you a set of common performance variables for that

element type. Using this data, you can identify contributing causes or the root of the

problem.

Clear the alarm. If the operator knows that the alarm is related to a known problem or

situation, the operator could clear the alarm and return the device status to normal.

Open Service Desk tickets to record the problem as a work task and assign it to

personnel to fix. With the Unicenter Service Desk integration, SPECTRUM can open,

update, and close Service Desk tickets that track work to address problems in the

network. Operators at the OneClick console can drill down to the Service Desk ticket

details to determine the latest status and assigned troubleshooter for the tickets.




Chapter 9: Predictive Capacity

PlanningEmployee productivity and customer satisfaction both depend on the availability and

performance of mission-critical applications. The applications depend on the IT

infrastructure running smoothly and efficiently.

Ensuring that IT resources meet the needs of your users requires more than just

responding to problems. To keep your infrastructure running efficiently, you must obtain

real-life data about the current status of your network, identify congestion and trouble

spots before they affect users, and plan effectively for the future. These tasks are all part of

predictive capacity planning.

Capacity planning is a complex and critical part of managing IT resources. It helps you to

use your current resources efficiently, evaluate trends in demand, and project future

resource needs. Effective capacity planning allows you to achieve the following:

Reduce costs through the reduction or elimination of underused leased lines.

Improve performance though identification of both overused and underused elements,

and rebalancing of capacity with demand.

Reduce server and network downtime by anticipating overloads before they occur, and

ensuring adequate capacity is in place.

Improve budget predictability by tracking trends and modeling the affects of new services

or infrastructure, allowing you to avoid emergency purchases and ensure you get the bes

prices.

This chapter describes how eHealth can help you to perform three major capacity planning

tasks:

Identify underutilized resources to find existing devices or resources that are underused,

resulting in unnecessary costs for leased lines and systems that are sitting idle.

Identify overutilized resources to find existing devices or resources that are overused,

resulting in performance degradation or penalty charges from overuse.

Plan future capacity needs to project capacity needs based on current demand trends or

anticipated business changes, allowing you to plan purchases and install upgrades as

needed.

Perform voice capacity planning to find over- and underutilization problems in your Telco

or converged networks.

Prerequisites: To use the best practices in this chapter, your eHealth database must have

at least a week of collected data. With more performance data and longer history, thesereports perform better for highlighting capacity trends and utilization problems.

These examples also assume that you are viewing the reports from the eHealth Web

interface. Reports on the Web interface have interactive “hot-spots,” which you can click to

drill down to other reports and closer detail. Drilldowns are not available from reports that

are in PDF format.




Additional References: Procedures for running, scheduling, and customizing reports are

described in detail in the eHealth Report Management Guide and the eHealth web help. For

details on the eHealth reports and how they work, see the eHealth web help. The best

practices in this section are taken from the Capacity Planning with eHealth topic, which is

available on the eHealth Support web site at http://support.concord.com.

How You Identify Underutilized Resources

To identify underused resources, follow these steps:

1. Locate underutilized resources.

2. Confirm underutilization.

3. Address underutilized resources.

4. Show ROI.

5. Update your configuration.

Locate Underutilized Resources

eHealth provides an Underutilized Elements report that allows you to quickly identify

elements that may be underutilized. This report is an optional Supplemental report in

eHealth Health reports. To view this report, you must customize a Health report to include

it.

To locate the underutilized elements in your network

1. Log in to the eHealth Web interface, and select the Run Reports tab.

2. Click the Standard Health report link. The Run Health Report page appears.

3. Specify the report subjects (for example, LAN/WAN technology, and a group of

elements).

4. Click More Options, and select Supplemental under Presentation Attributes.

5. In the list of Supplemental reports, select Underutilized Elements to include that report

6. Save the report with a unique name, such as Underutilization_Report.

7. Click Generate Report to run the report. Because this report can take several minutes

to run on demand, the recommended best practice is to schedule the report to run from

the eHealth console so that the report runs overnight or during a time when the

eHealth system is not very busy.

8. Review the Underutilized Elements supplemental report. The report lists elements that

meet the following criteria for the past 8 days:

› Never reached 50% utilization

› Did not reach 10% utilization more than 5% of the time




9. In the report, look for leased lines, routers, switches, and systems that have

underutilized bandwidth, CPU capacity, memory, or disk space. For example, the

following report shows several high-speed OC-3 lines that have very low usage and

should be investigated further.

When you run an Underutilized Elements report for only LAN/WAN elements, the

elements are sorted first by speed (since faster WAN links are more expensive), and

then by the percentage of time that they were underutilized.

BEST PRACTICES

As you use the Underutilized Elements report, consider the following best practices that can

help to make the report more meaningful for your environment:

When you first install eHealth, run it weekly to identify resources that are not being used.

After this initial period, you can run it less frequently (monthly or quarterly) to identify

usage changes in your network.

Since the Underutilized Elements report looks at data from the past 8 days, you should

schedule the report to run on Sunday so that you get data for an entire business week.

Depending on how your network is used, you can edit the service profile so that the

report includes data from only certain days or times, to eliminate periods of low network

usage such as nights or weekends.

Confirm Underutilization

After you find underutilized elements, analyze the purpose of each element and run reports

to confirm that it is actually underused. Run a monthly Health Report to confirm that it has

been underutilized for at least a month.

Important: Check unused network links to determine if they are backups. Since backups

are used only when the primary fails, they often do not have any usage.




To confirm that an element is underutilized

1. In the Health report (such as the one from the previous section), examine the

Bandwidth Utilization chart on the Element Detail page of the report.

The Bandwidth Utilization chart shows the load on each of the network interfaces over

the report period. For example, the bar for Helium 5734 is completely gray, indicating

that it did not have any usage during the month. Several other bars, such as Helium

7839, Miami, and Atlanta are all dark green, indicating they never exceeded 10%

usage. All of these interfaces appear underutilized.

2. Run a Bandwidth Trend Report by clicking the bar for the element that you suspect is

underused. The Bandwidth Trend Report shows the utilization for that element for the

same time period as the Health Report.




Net-Link Chicago T1 Paxton T1

Current Speed 100 Mps 1.54 Mbps 1.54 Mbps

New Speed 10 Mbs 512 Kbs 128 Kbs

Current Cost $3,300 $2,500 $5,000

New Cost $2,500 $2,000 $3,200

Switching Cost $5,000 $1,000 $1,200

Monthly Savings $800 $500 $1,800

4. Based on your ROI calculations, determine whether making the proposed changes

makes sense. For example, the table shows that downgrading the Net-Link from a 100Mbs line to a 10 Mbs line would save $800 each month, but the high switching cost

means that you would not break even for over six months. Downgrading the Paxton T1

to a 128 Kbs line, on the other hand, would give you an ROI in less than one month.

Update Your Configuration

After you change your configuration to resolve capacity issues, update your SPECTRUM and

eHealth environments to ensure that they reflect updated speeds and perhaps any

resources that have been retired.

To update your configuration

1.

Update your SPECTRUM views using rediscovery to ensure that they reflect the latestdevice information.

2. Update the eHealth polling configuration and element lists by re-importing the element

information from SPECTRUM and rediscovering your elements. Future reports for the

time ranges when element speeds changed may show unusual utilization percentages.

3. If you decreased capacity or added demand to existing resources, run Trend and Health

reports on those resources. Look for any Health exceptions or other utilization problems

that may result from the increased traffic.

4. If you eliminated an element, disable polling and retire the element in the eHealth

database. Retiring the element allows you to continue reporting on it until its data ages

out of the database.




How You Identify Overutilized Resources

To identify overutilized resources, follow these steps:

1. Locate overutilized resources.

2. Confirm overutilization.

3. Address overutilized resources.

Locate Overutilized Resources

eHealth’s capacity planning tools can help you to identify overutilized resources before they

start causing problems. By examining a single Health Report each week, you can identify

network elements that are reaching their capacity. You can then consult other reports to

analyze problems, and solve the issues before they become fires for your IT team.

To locate overutilized resources

1. On the eHealth web interface, run a daily Health report for the busiest day of your

week.

2. In the left pane of the Health report window, click Exceptions Summary to open the

Exceptions Summary Report. The Exceptions Summary report identifies elements that

have experienced unusual events or whose resources are consistently inadequate for

the demand on them. The elements are ranked by exception points, so that those

elements experiencing the worst problems are listed first.

3. Look for elements in the report that list Utilization Health Index or Congestion Health

Index in the Leading Exception column. These elements are experiencing high volume

and may be overutilized.

For example, the Frame Relay link to the Virginia office is listed first, and has Utilization

Health Index as its leading exception. This link is likely overutilized and should be

investigated further.

Confirm Overutilized Resources

The Situations to Watch chart identifies elements that are predicted to exceed, reach, or

come close to reaching their trend thresholds. The chart shows you how close each element

is to its threshold, how fast utilization is growing, and how long until demand exceeds

capacity.




To confirm that resources are overutilized

1. On the eHealth web interface, run a Health report for the previous week.

2. Review the Situations to Watch chart in the Summary section of the Health report.

3. Review the elements listed in the chart, looking for those that have exceeded their

threshold or are growing fast enough to soon reach it. For example, the first element

(Virginia) has already exceeded its threshold for two days, while the next two are

predicted to reach threshold in the next week. All of these are likely to be overutilized

elements. Demand on the final two elements listed is increasing, but both are still at

less than 20% capacity, and do not represent a problem.

4. Select Element Detail in the Health Report, and examine the Bandwidth Utilization chart

for the elements that you suspect to be overutilized.

The Bandwidth Utilization chart shows the percentage of time that each element was in

each usage range. Generally, purple and red colors indicate an overutilized resource.

Purple indicates greater than 100% utilization, meaning that the element is probably a

leased line exceeding its contracted bandwidth, and, therefore, incurring overage

charges.

5. Examine the chart to see how often a suspected element was overutilized during the

course of the week. Some elements may show consistently high demand (such as the

Virginia line), but since demand varies over time, most elements will show significant

periods of low usage. Depending on your network activity, an element may not have

any usage at certain times (overnight for example), but still be overutilized because

demand exceeds capacity at peak times.




For example, the Vermont line in the chart does not show any usage a third of the time

(possibly overnight), and is under 20% usage most of the time. However, since it

exceeds 100% utilization at peak demand, it could be incurring overage charges, and,

therefore, be considered overutilized.

6. To obtain more details about an element’s performance, create an At-a-Glance report

by clicking that element in the Bandwidth Utilization chart.

7. Review the Bandwidth Utilization charts in the At-a-Glance report to determine how

frequently the element was overused, and during what time periods. The sample charts

show that the element had 50% utilization most of the week, but peaked near 100%

several times. Depending on your business needs, an element that reaches its capacity

for only one hour per week may be acceptable, or that one hour of overutilization could

be a critical problem if it occurs at a key business time.

8. Review the other charts in the At-a-Glance report for any anomalies, including high

error rates or signs of congestion (forward explicit congestion notifications (FECNs),

backward explicit congestion notifications (BECNs), discards). Use this information todetermine the conditions that might be affecting the element such as the following:

› Insufficient capacity

› Inefficient or misconfigured applications consuming excessive bandwidth

› Too many or too few stations overloading a WAN link

› A highly repeated or bridged domain that should be routed

9. Establish a report trail to document evidence of high usage. In addition to the reports

described here, you can select specific elements in the Exceptions Summary Report and

Situations to Watch chart to run detail reports for those elements. You can also run

Bandwidth Trend reports on specific elements to show the long-term utilization of a

resource.

How You Address Overutilized Resources

After you have identified and documented underused resources, consider taking these

typical actions to resolve the problem:

Upgrade the element to a higher capacity.

Relocate demand to other resources.




Add additional elements to share the workload.

BEST PRACTICE

Use Capacity Trend What-If reports to visualize the effects of higher capacity or lower

demand on the overutilized element, and determine the optimal capacity of any new

resources. When you run the report, you specify an element, a capacity variable, and a

time range. The report shows the value of that performance variable during that historicalrange.

The What-If report is very similar to the eHealth Trend report; however, you can change

the capacity of the resource, the demand placed on the resource during that time, or both;

and then update the report to model the effects of possible changes.

Note: When you enter values for capacity and demand, note that you must specify

percentage values. For example, 100% causes the report to use the current values; 50%

causes the report to show half the current values (dividing the capacity or demand by 2);

and 200% causes the report to double the current values.

This report shows that by increasing the capacity of the Virginia line by 50% (capacity =125%), peak utilization would be reduced to about 60% of capacity. This capacity should be

sufficient to meet expected demand.




How You Plan Future Capacity Changes

To plan and predict capacity changes, follow these steps:

1. Identify potential capacity changes

2. Analyze capacity trends.

3. Visualize capacity changes.

4. Address capacity changes.

Identify Potential Capacity Changes

eHealth provides capacity planning reports that enable you to analyze the behavior of your

resources under varying conditions, and predict where and when you’ll need to add

capacity.

To identify potential capacity issues

1. Schedule a Health report to run every Sunday to ensure that you obtain data for an

entire business week. For instructions, refer to the section on customizing and

scheduling Health reports in Chapter 5 of this guide.

2. Examine the Situations to Watch chart in the Summary section of the Health report.

The Situations to Watch chart shows the top 10 elements (network interfaces, CPUs,

disk partitions) that are nearing their capacity. The chart shows how close each

element is to its threshold, how fast utilization is growing, and how long until demand

exceeds capacity.

This report shows several user partitions that are nearing their thresholds. In the Days

To Threshold column, System-Orange shows 0, meaning that utilization has reached

the Trend threshold. System-Green shows 20 days to threshold, and System-Pinkshows Increasing, indicating utilization is growing, but will not reach threshold for a

long period of time.

Each of the systems at or near their threshold merit further investigation. For example,

System-Orange could already be overutilized, or it could be a system partition designed

to operate near capacity. System-Green, on the other hand, is 20 days from meeting

its threshold, but could be a good candidate for upgrade if it is showing a steady

increase in demand.




3. To drill down to more information for each reported situation, click the element name to

run a Situations to Watch Detail report for the partition.

4. Examine the trend line to see how quickly the trend is approaching the threshold. If the

line is rising at a steady rate, as in this example, consider adjusting capacity by

increasing the size of the partition, deleting unneeded directories and files, or buying a

new system.

Analyze Capacity Trends

After identifying potential upgrade candidates, run Capacity Projection and CapacityProvisioning reports to forecast volume changes over the upcoming weeks and months, and

predict when elements need to be upgraded.

To run Capacity Projection and Capacity Provisioning reports

1. Log in to the eHealth Web interface, and select the Run Reports tab.

2. Click the Standard Health report link. The Run Health Report page appears.

3. Specify the report subjects (for example, System technology, and a group of

elements).

4. Click More Options, and do the following:

a. Under Presentation Attributes, select Capacity.

b. Select Capacity Projection and Capacity Provisioning to those reports.

c. Specify 20 in the Capacity Provisioning Minimum Lead-Time field.

d. Specify 90 in the Capacity Provisioning Maximum Lead-Time field.




5. Save the report as a template with a unique name, such as Capacity_Report.

6. Click Generate Report to run the report. The report can take a few minutes to run on

demand. As a best practice, you can schedule the report to run from the eHealth

console so that the report is automatically generated during off-peak hours and is

ready for review when you need it.

7. Review the Capacity Projection report. The report forecasts how the capacity of a

particular variable (partition utilization, for example) will change in the future. You can

run the report based on peak, average, or percentile capacity values. eHealth measures

the predicted capacity values against a threshold that you specify, and displays those

elements predicted to exceed the threshold.

This report displays the percentage of partition capacity that will be consumed on each

system at 30 days, 90 days, and nine months into the future. You can see that demand

on System-Orange is near threshold, but not increasing very much. Demand on

System-Purple; however, is quickly increasing and will soon exceed capacity. System-

Purple, therefore, may be in greatest need of upgrade.

8. To project when these elements will need to be upgraded, review the CapacityProvisioning report. The Capacity Provisioning report compares projected capacity

values against an upgrade threshold, and displays those elements predicted to exceed

the threshold, along with the number of days until an upgrade is required.

Like the Capacity Projection report, you can run this report based on peak, average, or

percentile capacity values. You can set both the upgrade threshold and an upgrade

lead-time window by customizing the Presentation Attributes for the Health report.

The report shows elements that are predicted to meet a 90% capacity upgrade point

within the next 20 to 90 days. System-Green is most in need of upgrade, and should be

addressed in the next 20 days.




BEST PRACTICES

For the Capacity Projection and Provision reports, it is important to know how much lead

time you need to bring new capacity online. For example, some service providers may

require 90 days to provide a new T1 line. For systems, you might need 30 days to order

and add new disk space or memory. Therefore, for the types of resources you manage, you

need to know when you must order additional capacity so that it is available — installed,tested, and turned over — before the upgrade point is reached.

These reports identify those locations for which additional capacity needs to be ordered

today to avoid reaching the threshold. The examples in this section show a 20-90 day lead

time, and all three locations are projected to require an upgrade within that window. If it

takes 90 days to add disk space or memory, it is very likely that the 90% upgrade

threshold will be violated during this time period. When you first start to use eHealth to

monitor your resources, you may find that some of your resources need upgrades sooner

than your lead times might allow; but over time, these reports will help you to isolate

problems earlier and avoid threshold violations before your lead time windows expire.

Visualize Capacity Changes

The Capacity Trend What-If report shows how resources perform as your infrastructure

changes and grows. These reports allow you to leverage historical data to predict future

patterns, model changes in capacity or demand, and determine the effect on resources.

To help visualize the impact of changes in demand

1. Run a Capacity Trend What-If report to analyze potential solutions:

a. On the Run Reports page in the Available Reports column under What-If, select

CapacityTrend or another template name. The Run a Capacity Trend What-If

Report page appears.

b. Select an element type from the Element Type list, and then select an element

from the Available elements list.

c. Select a variable for your report.

d. Under Chart type, select the chart format.

e. Under Divide by, specify how you want to graph the selected variable

f. Optionally, select a time interval during which the data is aggregated.

g. Select a sample size based on the time range for your report. The As Is sample size

uses the most granular data available and does not aggregate the values.

h. Under Report Time, select the report period. You can specify the values now,today, or yesterday, or an actual date or time value.

i. Select More Options to specify the hours and days that the report will show.

j. Optionally, customize the report by setting presentation attributes.

k. Click Generate Report.




2. Use the fields at the top of the report to adjust the capacity and/or demand for the

resource, and run the report again to model the change.

3. Use the report to model and determine whether an existing resource can support

anticipated changes and, if not, how much capacity must be added. You can also

illustrate potential problems so that you can propose requests for new equipment or

upgrades.

For example, this report shows that by doubling CPU capacity, demand on the server

will be well under the trend threshold of 80%, even with a 50% increase in demand.

How You Address Capacity Changes

After you have identified possible capacity issues or improvements, consider these typical

actions to resolve the problem:

Upgrade the element to a higher capacity

Replace the element with a larger or faster device (such as a larger disk, faster interface,or faster CPU system).

The What-If report can help you to model, or visualize, how the proposed changes will

improve performance. After making any changes to your devices or resources, update your

configuration as described in Chapter 5.




Voice Capacity Planning

For networks with traditional or IP voice telephony devices or voice messaging systems,

eHealth for Voice can help you to identify capacity problems and monitor GoS during the

peak hours of the network. For voice devices, capacity problems can include trunk/port

utilization and voice mail messaging disk space. When these factors are overutilized, it

impacts service degradations and customer satisfaction. When these factors are

underutilized, it is important to identify where your devices might be overprovisioned so

that you can take some steps to reduce costs or reallocate resources to resolve congestion

in other areas of the network.

Effective capacity planning enables you to achieve the following:

Reduce costs through the reduction or elimination of underused leased lines, as well as

the reduction of maintenance costs for unused or unnecessary ports.

Improve performance though identification of overused and underused ports or trunks

and rebalancing of capacity with demand.

Improve budget predictability by tracking trends, which helps you to avoid emergency

purchases and to ensure that you can research and plan for the best service costs.

To understand traffic patterns, you need to collect information from the PBX that details

peak traffic for each trunk group for at least a few weeks, preferably months. This

information is available on the switch. eHealth for Voice automates the collection of this

information, making it easier to run quarterly and on-demand maintenance assessments of

your voice capacity and usage patterns.

Analyze Voice Capacity

Once you have collected traffic for the desired period, you can use the Capacity Analyzer

tool to determine how well your voice devices are servicing customers during the busiest

hour. From this dialog, you can quickly calculate GoS, view disk space capacity for message

servers, and view process capacity for communications servers.

To access the Capacity Analyzer

1. On the system on which eHealth for Voice is installed, select Start, Programs, eHealth

for Voice, eHealth for Voice. The eHealth for Voice Program Console appears.

2. Select Measurements, Reports in the left navigation console tree.




3. Double-click the Capacity Analyzer icon in the right pane of the console. The Capacity

Analyzer dialog appears.

Analyze GoS

eHealth for Voice calculates a GoS to determine how callers are serviced (calls answered,

busy, or ring no answer) during the busiest hour of the period. This can help you to

determine additional bandwidth needed to carry voice traffic on the network.

To analyze the GoS

1. In the Capacity Analyzer dialog, select the Port Analysis tab to access the grade of

service calculation tool. The target grade of service refers to the percentage of callersthat will be serviced (calls answered) during the busiest hour. A GoS of .001 means

that .999% of callers will get through.

2. Select the target GoS and click Apply. The dialog shows the number of trunks that you

need to add to support that GoS.




3. Use the horizontal scroll bar to scroll to the right side of the dialog.

4. Review the Add/(Delete) Trunks column to determine the number of trunks that you

will need to provide that GoS. A number in parentheses shows the number of trunks

that you could remove and still be able to support the GoS during the peak hour, which

can help you to detect underutilized resources.

5. Review the Erlangs column to determine the actual peak traffic. An Erlang is a

measurement of voice traffic capacity. It represents how many minutes of voice traffic

occur during an hour of time. If 10 users each make one 10-minute call in a given hour

the hour had 100 minutes of calls, and had 1.67 Erlangs of traffic. This information

helps you to identify how much additional bandwidth you need on the network to

support voice.

BEST PRACTICES

As you use the Capacity Analyzer, consider the following best practices that can make

results more meaningful for your environment:

When you first install eHealth for Voice, run the Capacity Analyzer weekly to identify

resources that are not being used. After this initial period, you can run it less frequently

(monthly or quarterly) to identify usage changes in your network.

Trunk or port lines with zero or very low traffic could be backups or overflow lines. Before

proceeding with detailed service change plans, always confer with the person responsible

for PBX/IP-PBX engineering to ensure that you understand the purpose of any trucks or

ports.

The level of over- or underutilization varies, depending upon the GoS selected. As the

GoS decreases, the need for additional resources increases. Company service levels will

help define the GoS needed for your environment.




How You Address Underutilized Resources

After you have identified and documented underused resources, consider eliminating

unused trunks, ports, or PBXs to reduce the service charges and/or maintenance charges

for your network.

Show ROI

After you have identified capacity changes that you could make, you can calculate and

show potential monthly savings from eliminated trunks and the difference in maintenance

costs between current and future configurations.

To estimate the ROI

1. Review your monthly usage fees to identify the cost of leased lines that may be

underutilized.

2. Review your port maintenance fees to identify costs for unused ports.

3. Contact your service providers to identify possible costs for changing service or

reducing the number of ports. If you have internal costs for changing service, take

those costs into consideration as well.

4. Calculate the ROI for making changes using the following equation:

ROI = (service change + port-change fees) / monthly savings

5. Based on your ROI calculations, determine whether making the proposed changes is

wise.

How Y ou Address and Confirm Overutilized Resources

The Capacity Analyzer provides the peak traffic for a given timeframe as well as therequired number of trunks or ports to handle the traffic load for the desired grade of

service.

To confirm the results of overutilization, do the following:

Run the Capacity Analyzer again and select a more granular date range. For example, if a

quarterly-report peak hour shows utilizations that seem unusually out of range, evaluate

each month to see the pattern or trends of busy-hour data. This can help you to

investigate whether the busy hour is an anomaly, or if the traffic is growing in your

network. If the busy hour is related to a one-time event, you can ignore this atypical

activity in your capacity planning.

Running Voice traffic reports for the platform will show trends in trunk or port usage. In

this way, you can see if there is any overflow to another trunk group.

Verify the GoS selected with the person responsible for PBX/IP-PBX engineering for this

trunk group. Confirm that your analysis for each trunk group uses the GoS originally

intended or planned for that group.

Contact your service provider to add additional capacity, such as adding trunks to the

hunt group or if a fractional T1, to add more capacity.




Chapter 10: Rapid Problem

ResolutionWhen even the smallest problem occurs in a network, a wide range of services and

capabilities can be affected. Network management systems detect these problems and can

often send streams of events to report slowdowns, outages, and impacted services. This

barrage of information, though accurate, often hinders troubleshooting efforts simply

because of the amount of data that operators must filter through.

CA’s Network and Voice Management solution helps you to direct your troubleshooting

efforts to the source of the problem. SPECTRUM software performs event correlation,

impact analysis, and RCA for multiple vendors and technologies across network, system,

voice, and application infrastructures. Combined with eHealth’s ability to find and report on

performance behavior changes, and eHealth for Voice’s ability to monitor the policies and

capacities of voice networks, CA offers a key solution for identifying problems, quickly

targeting the real source of the problem, and providing deeper insights into historical trends

and reports.

This chapter describes how SPECTRUM’s problem resolution and root cause identification

processes work.

Problem-Solving Techniques

SPECTRUM offers three intelligent, automated, and integrated approaches to problem

solving:

Model-based IMT

Rules-based EMS

Policy-based Condition Correlation Technology (CCT)

SPECTRUM is fundamentally a model-based system. Model-based systems are adaptable to

changes that regularly occur in a real-time, on-demand, IT infrastructure. Rules-based

systems are flexible in allowing customers to add their own intelligence without requiring

programming skills. SPECTRUM combines the best of both approaches, using models to

keep up with changes while leveraging easy-to-create rules running against the models to

avoid the need for constant rule editing. Policy-based systems are automated means of

connecting seemingly unrelated pieces of information to determine condition and state of

physical devices and logical services. This condition correlation engine combines with

SPECTRUM’s modeling engine and rules engine to deliver a higher level of cross-silo service

analysis.

You can place almost every service delivery infrastructure problem into one of three

categories: availability, performance, or threshold exceeded. Infrastructure faults occur

when things break, whether they are related to LAN/WAN, server, storage, database,

application or security. Infrastructure performance problems often result in brown-out

conditions in which services are available but are performing poorly. From the user’s

perspective, a slow infrastructure is a broken infrastructure. The final category is abnormal

behavior conditions in which performance, utilization, or capacity thresholds have been

exceeded as demand/load factors fall significantly above or below observed baselines.




SPECTRUM, eHealth, and eHealth for Voice can detect these problems in your network and

raise alarms when problems occur. By sending all alarms to SPECTRUM, you can pinpoint

the cause of problems.

Model-based, rules-based, and policy-based analytics in SPECTRUM understand

relationships between IT infrastructure elements and the customers or business processes

that they are designed to support. It is through this understanding of relationships thatSPECTRUM has been shown to deliver 70% reduction in downtime while resolving 90% of

availability or performance problems from a central location. SPECTRUM’s RCA has been

able to reduce the number of alarms by several orders of magnitude while reducing MTTR

from hours to minutes. SPECTRUM’s distributed management architecture has also proven

effective at performing RCA for over 5 million devices (20+ million ports) in a single

environment with fully meshed and redundant core and distribution network layers. Our

integrated approach to fault and performance management has enabled enterprise,

government, and service provider organizations around the world to manage what matters

through service level intelligence.

Complex Problems and Pow erful Solutions

IT infrastructure operations management is a difficult and resource-intensive — yetnecessary — undertaking. When the infrastructure fails or slows down, tools are required to

quickly pinpoint the root cause, suppress all symptomatic faults, prioritize based on

business impact, and aide in the troubleshooting and repair process to accelerate service

restoration.

To ensure the performance and availability of the infrastructure, most companies employ a

dual approach of highly available, fault-tolerant, load-balancing designs for infrastructure

devices and communication paths, and a management solution to ensure proper operation.

In fact, the job of the management solution is further complicated by today’s high-

availability environments. The management solution must understand the load-balancing

capacity; it must be able to track primary and fault-tolerant backup paths; and understand

when redundant systems are active. The investment in the management solution is as

important as the investment in the infrastructure itself.

Problem P rediction and P revention

Management software should help predict or prevent problems. CA’s out-of-the-box

utilization, performance, and response time thresholds can be used to act as an early

warning system when a problem is about to happen or when a service level guarantee is

about to be violated. While these thresholds can obviously be tuned for a specific customer

environment, it is also important to have out-of-the-box thresholds that are relevant from

the start of your monitoring baselines.

Before you can begin the true task of troubleshooting, you must isolate the problem.

Simply being aware of the problem and collecting the data is not sufficient. To effectively

triage the issue, you need to determine the location or source of the problem (and wherethe problem does not exist). If multiple problems are occurring simultaneously, you should

be able to automatically prioritize issues based on impacted customers, services, or

infrastructure devices. It is far too costly to rely on human intervention to determine the

root cause of problems, and to sift through an unending stream of symptomatic problems.

Every minute that you devote to isolating the problem is a minute lost to solving the

problem.




status, parameter-based threshold violations, response time measurement threshold

violations, deviations from historical performance, and health analysis.

SPECTRUM’s RCA is the automated process of troubleshooting the infrastructure and

identifying the managed elements that have failed to perform their function. The goal of

SPECTRUM’s RCA is straightforward: identify a single source of failure, the Root Cause, and

generate the appropriate actionable alarm for the failed managed element.

Inductive Modeling Technology

The core of SPECTRUM’s RCA solution is its patented IMT. IMT uses a powerful object-

oriented modeling paradigm with model-based reasoning analytics. In SPECTRUM, IMT is

most often used for physical and logical topology analysis as SPECTRUM can automatically

map topological relationships through its auto-discovery engine.

In SPECTRUM, a “model” is the software representation of a real-world managed element,

or a component of that managed element. This representation allows SPECTRUM to not only

investigate and query an individual element within the network, but also provides the

means to establish relationships between elements to recognize them as part of a larger

system. IMT’s RCA is based on a sophisticated system of models, relationships and

behaviors that create a software representation of the infrastructure. Decisions concerning

which element is at fault are not determined by looking at a single element alone. Instead,

the relationship between the elements is understood and the conditions of related managed

elements are factored into the analysis. Models are in direct communication with their real-

world counterparts, enabling SPECTRUM to not only listen, but proactively query for health

status or additional diagnostic information. Models are described by their attributes,

behaviors, relationships to other models, and algorithmic intelligence.

Intelligent analysis is enabled through the collaboration of models in a system. This

collaboration enables correlation of the symptoms, suppression of unnecessary alarms, and

impact analysis of affected users, customers, and services. Collaboration includes the ability

to exchange information and initiate processing between any models within the modeling

system. A model that is making a request to another model may, in turn, trigger that modeto make requests to other models, and so on. Relationships between models provide a

context for collaboration.

Collaboration between models enables the following:

Correlation of the symptoms

Suppression of unnecessary/symptomatic alarms

Impact analysis

A simple example of IMT in action can be demonstrated by a network router port transition

from UP to DOWN. If a port model receives a LINK DOWN trap, it has intelligence to react

by performing a status query to determine if the port is actually down. If it is, in fact,

DOWN, it consults the system of models to determine if the port has lower layer sub-

interfaces. If any of the lower layer sub-interfaces are also DOWN, only the condition of the

lower layer port will be raised as an alarm. An application of this example can be described

by several Frame Relay DLCIs transitioning to INACTIVE. If the Frame Relay port is DOWN,

IMT will suppress the symptomatic DLCI INACTIVE conditions and raise an alarm on the

Frame Relay port model. Additionally, when the port transitions to DOWN, IMT will query

the status of the connected Network Elements (NEs) and if those are also DOWN, those

conditions will be considered symptomatic of the port DOWN, will be suppressed, and will




be identified as impacted by the port DOWN alarm. Root cause and impact are determined

through IMT’s ability to both listen and talk to the infrastructure.

Event Management System

At times, event streams local to a specific source are the only source of management

information. Any one event may or may not be a significant occurrence — but in the

context of other events, information, or time, it may be an actionable condition. Event

Rules in SPECTRUM’s Event Management System provide a more complex decision-making

system to indicate how events should be processed. You can apply Event Rules to look for a

series of events to occur on a model in a certain pattern, within a specific timeframe, or

with certain data value ranges. You can use Event Rules to generate other events or even

alarms.

If events occur that meet the preconditions of a rule, SPECTRUM may do the following:

Generate another event, allowing cascading events.

Log the event for later reporting/troubleshooting purposes.

Promote the event into an actionable alarm.

SPECTRUM provides six customizable Event Rule types that form the basis of the Event

Management System rules-based engine.

These rule types are building blocks that can be used individually or cooperatively to effect

an alarm on the most simple or sophisticated event-oriented scenarios. This Event

Management System rules engine allows for the correlation of event frequency/duration,

event sequence and event coincidence.

The Event Rule types are as follows:

Event Pair (Event Coincidence): This rule generates an error when the first of two

events that you define do not occur in sequence. If the second event in a series does not

occur, this may indicate a problem. The Event Pair rule type creates a more relevantevent based on this scenario. Event rules based on the Event Pair rule type generate a

new event when an event occurs without its paired event. It is possible for other events

to occur between the specified event pair without affecting this event rule.

Event Rate Counter (Event Frequency): This rule type generates a new event based

on events that occur at a specified rate in a specified time span. A few events of a certain

type might not be a problem, but if the number of these events reaches a certain

threshold within a specified time period, notification is required. SPECTRUM does not

generate additional events if the rate stays at or above the threshold. If the rate drops

below the threshold and then subsequently rises above the threshold, it generates

another event. The Event Rate Counter type is best suited for detecting a long, sustained

burst of events.

Event Rate Window (Event Frequency): This rule type generates a new event when a

number of the same events are generated in a specified time period. The Event Rate

Window type is best suited for accurately detecting shorter bursts of events. It monitors

an event that is not significant if it occurs occasionally, but is significant if it happens

frequently within a short period of time. If an event occurs a few times during the day, a

problem may not exist. If an event occurs five times in one minute, perhaps that is a

condition for which you want to be notified. If the event occurs above a certain rate,

SPECTRUM generates another event. SPECTRUM will not generate additional events if the




rate stays at or above the threshold. If the rate drops below the threshold and then

subsequently rises above the threshold, it generates another event.

Event Sequence (Event Sequence): This rule type generates an event when a

particular order of sequenced events might be significant in your environment. This

sequence can include any number and any type of events. When the sequence is detected

in the given period of time, SPECTRUM generates a new event.

Event Combo (Event Coincidence): This rule type generates a new event when a

certain combination of events occurs in any order. The combination can include any

number and type of events. When the combination is detected within a given time period,

SPECTRUM generates a new event.

Event Condition (Event Coincidence): This rule type generates an event based on a

conditional expression. Part of SPECTRUM’s “trust but verify” methodology — a series of

conditional expressions can be listed within the event rule and the first expression that is

found to be TRUE will generate the event specified with the condition. You can construct

rules to provide correlation through a combination of evaluating event data with IMT

model data (including attributes which can be read directly from the remote managed

element). For example, if a trap is received notifying the management system of memory

buffer overload, to validate that an alarm condition has occurred, an Event Condition rulecan initiate a request to the device to check actual memory utilization.

SPECTRUM implements a number of event rules out-of-box by applying one or more of the

event rule types to event streams. You can create or customize event rules using any of the

rule types and apply these Event Rules on other event streams. Further implementation of

event rules using the Event Management System is discussed later in this paper.

Condition Correlation

To perform more complex user-defined or user-controlled correlations, SPECTRUM offers a

policy-based CCT that enables the following:

Creation of correlation policies

Creation of correlation domains

Correlation of seemingly disparate event streams or conditions

Correlation across sets of managed elements

Correlation within managed domains

Correlation across sets of managed domains

Correlation of component conditions as they map to higher order concepts such as

business services or customer access

Several important concepts relate to condition correlation:

Conditions: A condition is similar to state. An event/action can set a condition and it can

clear it. It is also possible to have an event set a condition but require a user-based

action to clear the condition. The condition exists from the time it is set until the time it is

cleared. A very simple example of a condition is a “port down” condition. The “port down”

condition will exist for a particular interface from the time that the LINK DOWN trap or set

event (such as a failed status poll) is received until the time the LINK UP trap or clear

event (such as a successful status poll) is received. A number of conditions that may be

useful for establishing domain level correlations are defined out-of-box in SPECTRUM, and

you can add more.




Seemingly Disparate Conditions: Many devices in an IT infrastructure provide a

specific function. The device-level function is often without context as it relates to the

functions of other devices/components. Most managed elements can emit event streams,

but those event streams are local to each component. A simple example is when a

Response Time Management system identifies a condition of a test result exceeding a

threshold. At the same time, an Element Management System may identify a condition of

a router port exceeding a transmit bandwidth threshold. These conditions are seeminglydisparate as they are created independently and without context or knowledge of each

other. In reality, the two are often closely related; that is, an overutilized port could be

the cause of the response degradation.

Rule Patterns: Rule Patterns associate conditions when specific criteria are met. A

simple example is a “port down” condition caused by a “board pulled” condition. The two

conditions are likely related if the port and board have the same slot number. The

following diagram illustrates this rule pattern. A rule pattern can result in the creation of

an actionable alarm or the suppression of symptomatic alarms.

Correlation Domains: You can use a Correlation Domain to both define and limit the

scope of one or more Correlation Policies. You can apply it to a specific Service. For

example, in the Cable Broadband environment, a return path monitoring system may

detect a return path failure in a certain geographic service area. This “return path failure”

condition is causing subscriber’s high-speed cable modems to become unreachable and

Video on Demand (VoD) pay-per-view streams to fail. The knowledge that the return path

failure, the modem problems, and the failed video streams are all in the same correlation

domain is essential to correlating the events and ultimately identifying the root cause.

However, it is also important to have the ability to distinguish that a “return path failure”

condition occurring in one correlation domain (Philadelphia) should not be correlated with

VoD stream failure conditions occurring in a different correlation domain (New York).

Correlation Policies: You can bundle Multiple Rule Patterns into Correlation Policies. You

can then apply Correlation Policies to a Service or Correlation Domain. For example, you

can create a bundle of rule patterns applicable to OSPF and label them the OSPF

Correlation Policy. You can apply the OSPF Correlation Policy to each Correlation Domain,

where each autonomous OSPF region and the supporting routers in that region define the

Correlation Domain. As another example, you could define Correlation Policy based on a

set of rule patterns that operate within the confines of a MPLS/BGP VPN, labeled as the




Intra-VPN Policy, and apply them to all modeled VPNs. Whenever you add a rule to a

Correlation Policy, or delete one from it, SPECTRUM automatically updates all related

Correlation Domains immediately. You can apply multiple Correlation Policies to any

Correlation Domain, and apply a Correlation Policy to many Correlation Domains.

Condition-based correlations are very powerful and provide a mechanism to develop

Correlation Policies and apply them to Correlation Domains. When you apply them to

Service Level Management, Correlation Policies are similar to metrics of an SLA, and

Correlation Domains are similar to service, customer, or geographical groupings.

Occasionally, the only way to infer a causal relationship between two or more seemingly

disparate conditions is when those conditions occur in a common Correlation Domain. These

mechanisms are necessary when you SPECTRUM cannot discover causal relationships

through interrogations.

Fault Scenarios

Out-of-box, SPECTRUM addresses a wide range of different scenarios to which it can

perform RCA. This section provides specific scenarios where the techniques described in the

previous section are employed to determine RCA and impact analysis. For the sake of

simplicity and brevity, the detail will be limited to the basic processing. Also, for the

purpose of the discussion and figures, the following table shows the color of alarms that are

associated with the icon status of SPECTRUM models at any given time.

Communication Outages and Impacts

Communication outages are types of faults often described as “black-outs” or “hard faults.”

With these types of faults, one or more communication paths are degraded to the point that

traffic can no longer pass. The fault could be caused by many situations including broken

copper/fiber cables/connections, improperly configured routers/switches, hardware failures,

severe performance problems, security attacks, and so on. With these hard communication

failures, limited information is available to the management system as it is unable to

exchange information with one or more managed elements. With SPECTRUM’s sophisticated

system of models, relationships, and behaviors available through IMT, SPECTRUM can infer

the fault and impact. IMT inference algorithms are also called Inference Handlers. A set of

Inference Handlers designed for a purpose is referred to as an Intelligence Circuit, or simply

Intelligence.




How SPECTRUM’s Intelligence Isolates Communication Outages

SPECTRUM offers powerful capabilities that can help you to identify the real sources of

problems in the network. For many management solutions, the steps to achieve this

capability are often manual and very time-intensive. With SPECTRUM, however, many of

these steps are performed automatically by the SPECTRUM software.

The process that SPECTRUM uses to identify and isolate outages is as follows:

1. Use SPECTRUM discovery to build a model of your infrastructure that shows the

resources in your network and how they are connected.

2. Upon receipt of a problem event, SPECTRUM checks the status of closely-connected

resources to determine whether they have problems.

3. SPECTRUM analyzes the status of the resources to identify the likely root cause of the

problem.

4. SPECTRUM suppresses alarms that are symptoms of the root cause, but not the cause

itself.

5. SPECTRUM evaluates the severity of the problem to help prioritize the problem among

any other reported problems in the network.

The following sections describe these SPECTRUM capabilities in more detail.

BUILD THE MODEL WI TH AUTODISCOVERY

An accurate representation of the infrastructure is critical for determining the fault and the

impact of the fault. SPECTRUM’s modeling system can represent not only a wide array of

multi-vendor equipment, but also a wide range of technologies and connections that can

exist between various infrastructure elements. SPECTRUM has specific solutions for

discovering multi-path networks over a variety of technologies supporting many different

architectures. SPECTRUM offers support for meshed and redundant, physical and logicaltopologies based on ATM, Ethernet, Frame Relay, HSRP, ISDN, ISL, MPLS, Multicast, PPP,

VoIP, VPN, VLAN and 802.11 Wireless environments — even legacy technologies such as

FDDI and Token Ring. SPECTRUM’s modeling is extremely extensible and can be used to

model OSI Layers 1-7 in a communication infrastructure.

SPECTRUM provides four different methods for building the physical and logical topology

connectivity model for any given infrastructure:

SPECTRUM’s AutoDiscovery application automatically and dynamically interrogates the

managed infrastructure about its physical and logical relationships. This approach to

AutoDiscovery was patented in 1996, and SPECTRUM was the industry’s first product to

discover Layer 2 switch connectivity. SPECTRUM’s AutoDiscovery application works in two

distinct phases (although there are many different stages within each phase that are notcovered here). The first phase is Discovery. When initiated (as described in Chapter 5),

AutoDiscovery automatically discovers the elements that exist in the infrastructure. This

provides SPECTRUM with an inventory of elements that could be managed. The second

phase is Modeling. AutoDiscovery uses management and discovery protocols to query the

elements it has found to gain information that will be used to determine the Layer 2 and

Layer 3 connectivity between managed elements. For example AutoDiscovery uses SNMP

to examine route tables, bridge tables, and interface tables, but also uses traffic analysis




and vendor proprietary discovery protocols such as Cisco’s CDP. AutoDiscovery is a very

thorough and automated mechanism for building the infrastructure model.

The Modeling Gateway imports a description of the entire infrastructure’s components, as

well as physical and logical connectivity information from external sources, such as

Provisioning systems or Network Topology databases.

The command line interface or Programmatic APIs can build a custom integration orapplication to import information from external sources.

Graphical user interfaces allow users to quickly point, click, and drag and drop to

manually build the model.

SPECTRUM’s modeling scheme allows a single managed element to be logically divided into

any number of sub-models. This collection of models and the relationships between them is

often referred to as the semantic data model for that type of managed element. Thus, a

typical semantic data model for a networking device may include a chassis model with

board models related to the chassis. Physical interface models would be associated to the

board models. Each physical interface model may have a set of subinterface models

associated below them.

SPECTRUM has a set of well-defined associations that define how different semantic datamodel sets act with one another. When SPECTRUM represents the connectivity between two

devices, a relationship is established not only between the two ports that form the link

between them, but also between device models and to the corresponding interface and port

models of other devices, as shown in the following figure.

START THE PROBLEM ANALYSIS

SPECTRUM can begin to solve a problem proactively upon receipt of a single symptom.

Many problems share the same set of symptoms, but SPECTRUM must perform further

analysis to determine the root cause. For communication outages, the analysis begins when

a model in SPECTRUM recognizes the communications failures through failed polling, traps,

events, performance threshold violations, or lack of response. SPECTRUM automatically

validates the communication failures through retries, alternative protocols, and alternative

path checking as part of its “trust but verify” methodology. The model that raised the




problem which started the intelligence is called the initiator, although more than one model

can trigger the intelligence.

The initiator model intelligence requests a list of other models that are directly connected to

it. These connected models are referred to as the initiator model’s neighbors. For example,

the following figure shows five models, where Model B is the initiator, and models A, C, D,

and E are neighbors.

With a list of neighbors identified, the intelligence directs each neighbor model to check its

current status. This check is referred to as the “Are You OK?” check. “OK” is a relative term

and a unique set of attributes related to performance and availability will vary from model

to model based on the real-world capabilities of the device that the model is representing.

When a model is asked “Are You OK?”, the model can initiate a variety of tests/checks to

verify its current operational status. For example, with most SNMP-managed elements, the

check is typically a combination of SNMP requests but could be more involved by

interrogating an Element Management System or as simple as an ICMP Ping. A

comprehensive check could include threshold performance calculations or execution of response time tests.

Each neighbor model returns an answer to “Are YOU OK?”.




LOCATE THE ROOT CAUSE — FAULT I SOLATION

If the initiator model has a neighbor that responds that it is “OK”, such as Model A in the

previous figure, SPECTRUM can infer that the problem lies between the unaffected neighbor

and the affected initiator (Model B). In this case, the initiator model that triggered the

intelligence is a likely culprit for this particular infrastructure failure. As a result, SPECTRUM

raises a critical alarm on the initiator model, which is considered the “Root Cause” alarm, as

shown in the next figure.

HIDE THE NOISE OF SYMPTOMATIC PROBLEMS WI TH ALARM SUPPRESSION

As the analysis continues beyond isolating the device at fault (Model B), the next step is to

analyze and suppress reporting of the effects of the fault. This is the goal of intelligent

Alarm Suppression. If a neighbor (such as Models C, D, or E) of the initiator model

responds that it is not OK, this neighbor is considered to be affected by the failure occurring

elsewhere in the infrastructure. As a result, SPECTRUM places these models into a

suppressed condition (Grey Color) because the alarms are symptomatic of a problemelsewhere. While these resources are experiencing problems, they are not the root cause

problem; they will likely be fixed when operators have addressed the problems that are

affecting Model B.




PRIORI TIZE THE PROBLEM — IMPACT ANALYSIS

SPECTRUM continues to analyze the total impact of the fault because of its ability to

understand that the individual models exist as part of a larger network of models

representing the managed infrastructure.

As such, the intelligence will analyze each Fault Domain, which is the collection of models

with suppressed alarms related to the same failure. These impacted models are linked tothe root fault for presentation and analysis. The intelligence provides a measurement of the

impact that this fault is having by examining the models that are included within this Fault

Domain and calculating a measurement that serves as the impact severity. The impact

severity value provides a ranking system so that operators can quickly assess the relative

impact of each particular infrastructure fault in order to prioritize their corrective actions.

Event Management System

Event Rules provide even more processing and correlation of event streams. Event Rule

processing is required for situations in which the event stream is the only source of

management information. For example, SPECTRUM’s Southbound Gateway enables

SPECTRUM to accept event streams from devices and applications not directly monitored by

SPECTRUM, such as the eHealth for Voice PBX devices and message servers. You can alsoapply event rules to perform intelligent processing of events within certain contexts;

frequency, sequence, combination. As described earlier in this chapter, you can apply six

event rule types as event rules:

Event Pair: Expected pair event or missing pair event in specified time span.

Event Rate Counter: Events at specified rate in specified time span.

Event Rate Window: Number of events in specified time span.

Event Sequence: Ordered sequence of events in specified time span.

Event Combo: Two or more events, any order in specified time span.

Event Condition: Events parsed for specific data to allow creation of new events based oncomparisons of variable bindings, attributes, constants, etc.

SPECTRUM provides many out-of-the-box event rules, but also provides easy-to-use

methods for creating new rules using one or more of the event rule types. This section

highlights a couple of out-of-box event rules and also a few customer examples of event

rule applications.

OUT-OF-BOX EVENT PAIR RULE

SPECTRUM has the ability to interpret Cisco syslog messages as event streams. Each syslog

message is generated on behalf of a managed switch or router and is directed to the

SPECTRUM model representing that managed element. One of the many Cisco syslog

messages indicates a new configuration has been loaded into the router. The Reload

message should always be followed by a Restart message, indicating the device has beenrestarted to adopt the newly loaded configuration. If not, a failure during reload is probable

SPECTRUM uses an event rule based on the Event Pair rule type to raise an alarm with

cause “ERROR DURING ROUTER RELOAD” if it does not receive the Restart message within

15 minutes of the Reload message. The following diagram illustrates the events and timing.




MANA GING SECURITY EVENTS USING AN EVENT RATE COUNTER RULE

SPECTRUM is able to collect event feeds from many sources. Some customers send events

from security devices such as Intrusion Detection Systems (IDSs) and firewalls. These

types of devices can generate millions of log file entries. These customers could use an

Event Rate Counter rule to distinguish between sporadic client connection rejections and

real security attacks. The rule generates a critical alarm if 20 or more connection failures

occurred in less than one minute, as shown in the following figure.

MANA GING SERVER MEMORY GROWTH USI NG AN EVENT SEQUENCE RULE

A common problem with some applications is the inability to manage memory usage. Some

applications will use system memory and never free it again for other applications to use.

This can degrade performance on the host machine, and eventually the “memory leaking”

application will fail. As one example, if you have a web server application with a history of

slow memory leak problems, you might schedule a reboot once a week during a planned

maintenance window to compensate for the memory consumption. However, if the memory

leak occurs more quickly than usual, which is a deviation from normal behavior, you might

want to perform an emergency reboot before the scheduled maintenance. You can employ




a combination of progressive SPECTRUM thresholds with an Event Sequence rule to monitor

for abnormal behavior, or you could use eHealth Live Health analysis to report the deviation

from normal memory consumption. Using the SPECTRUM thresholds as an example, you

could set monitoring to create events as the memory usage passed threshold points of

50%, 75% and 90%. If those threshold points are reached in a period of less than one

week, SPECTRUM generates an alarm to provide notification to reboot the server prior to

the scheduled maintenance window, as shown in the following diagram.

AN OUT-OF-BOX EVENT CONDITION RULE COMBINED WI TH AN EVENT PAI R RULE

RFC2668 (MIB for IEEE 802.3 Medium Attachment Units) provides management definitions

for Ethernet hubs. Within the RFC, is the definition of an SNMP trap used to notify a

management system when the “jabber state” of an interface changes. Jabber occurs when

a device that is experiencing circuitry or logic failure continuously sends random (garbage)

data. The trap identifier simply indicates a change in condition and the variable data portion

of the trap indicates whether “jabbering” has started or stopped. SPECTRUM applies anEvent Condition rule to create distinct start/stop events by looking at the variable portion of

the trap, and uses an Event Pair rule to create an alarm if the “jabbering start” is not

closely followed by a “jabbering stop” event.

CONDITION CORRELATION TECHNOLOGY

The CA SPECTRUM CCT offers advanced customization capabilities for defining event

relationships to isolate root causes of problems. For example, consider the complexities of

managing an IP network that provides VPN connectivity across an MPLS backbone with

intra-area routing maintained by Intermediate System-to-Intermediate System (IS-IS) and

inter-area routing maintained by BGP. Any physical link or protocol failure could cause

dozens of events from multiple devices. Without applying sophisticated correlation carefully

the network troubleshooters could spend most of their time chasing after symptoms, ratherthan fixing the root cause.




AN IS-IS ROUTING FAILURE EXAMPLE

The following example illustrates the range of capabilities for Condition Correlation. A core

router, labeled in the figure as R1, has lost IS-IS adjacencies to all neighbors (labeled in

the figure as R2, R3, and R4). This also causes the BGP session with the route reflector

(labeled in the figure as RR) to be lost. This condition, if it persists, will result in routes

aging out of R1 and adjacent edge routers R3 and R4. Eventually, the customer VPN sites

serviced by these edge routers will be unable to reach their peer sites (labeled in the figure

as CPE1, CPE2, CPE3).

This failure causes the routers to generate a series of syslog error messages and traps. The

following table shows the messages and traps that SPECTRUM would receive:

The root cause of all these messages is the IS-IS routing problem related to R1. For many

management systems, the operator or troubleshooter would see each of these messages

and traps as seemingly disparate events on the event/alarm console. A trained operator or

experienced troubleshooter may be able to deduce, after some careful thought, that an R1

routing problem has occurred. However, in a large environment, these events/alarms will

likely be interspersed with other events/alarms cluttering the console. Even if the operator

or troubleshooter had the experience to identify the correlation manually, effort and time




would be devoted to doing so. That time is directly related to costs, lower user satisfaction,

and lost revenue.

Without condition correlation, SPECTRUM would send the alarm console users notification of

ten or more events. However, using a combination of an Event Rule and Condition

Correlation, you can apply a set of rule patterns to a Correlation Domain consisting of all

core (LSR) routers, enabling SPECTRUM to produce a single actionable alarm. This alarmwill indicate that R1 has an IS-IS routing problem, and a network outage may result if this

is not corrected. The seemingly disparate conditions that SPECTRUM correlates which

results in this alarm appear in the “symptoms” panel of the alarm console as follows:

1. A local Event Rate Counter rule was used to define multiple ‘IS-IS adjacency change’

syslog messages reported by the same source as a routing problem for that source.

2. A rule pattern was used to make an IS-IS adjacency lost event “caused by” an IS-IS

routing problem when the neighbor of the adjacency lost event is equal to the source of

the routing problem event.

3. A rule pattern was used to make a BGP adjacency down event “caused by” an IS-IS

routing problem when the neighbor of the adjacency down event is equal to the sourceof the routing problem event.

4. A rule pattern was used to make a BGP backward transition trap event “caused by” an

IS-IS routing problem when the neighbor of the backward transition event is equal to

the source of the routing problem event.

AN HSRP/ VRRP ROUTING FAILURE EXAMPLE

Condition Correlation can also provide interesting and useful correlation of events when a

link is lost to a router in a Hot Standby Routing Protocol (HSRP) or Virtual Router

Redundancy Protocol (VRRP) environment. In the following example, a site has two

redundant routers that provide access via HSRP. For this case, the primary router

experiences a failure, but the redundant router is still servicing the customer’s site. You

might want an alarm notification of the redundant fail-over, and distinguish that from a

total site outage. Knowledge from IMT, EMS and CCT can help to provide the RCA.




The following table outlines the syslog error messages and trap sequences for the HSRP

failover.

The seemingly disparate conditions that SPECTRUM correlated to create this alarm appear

in the “symptoms” panel of the alarm console as follows:

1. A correlation domain consisting of only of the two CPE HSRP routers and the PE router

interfaces that connect to these sites.

2. A rule pattern correlating the coincidence of an HSRPGrpStandByState event with a

state of active and a Device Contact Lost event to infer a Primary Connection Lost

condition.

3. A rule pattern that defines a Bad Link event caused by a Primary Connection Lost

event.

It applies these rule patterns to the HSRP correlation domains to prevent any correlations

outside of that scope.

Without these rules, SPECTRUM would have raised a critical alarm on the lost CPE device,

and on the connected port model. With these rules, it raises a major (Orange) alarm on the

CPE device indicating that the primary connection to the customer is lost. The other

conditions will appear in the symptoms table of this alarm.




Apply Condition Correlation to Service Correlation

Typically, networks carry and support more than one service. As an example, in the cable

industry, telephone service (VoIP), internet access (High Speed Data), VoD and digital

cable are delivered over the same physical data network. Managing this network can be

quite a challenge. Inside the network (cable plant), the video transport equipment, video

subscription services, and the Cable Model Termination System (CMTS) all work together to

put data on the cable network at the correct frequencies. Uncounted miles of cable along

with thousands of amplifiers and power supplies must carry the signals to the homes of

millions of subscribers.

If the network lines are cut in one area, as shown in the following diagram, the return path

monitoring system and the head end controller would report return path and power

problems in that area. The CMTS would provide the number of cable modems off-line for

the node. The video transport system would generate tune errors for video subscriptions in

that area. Lastly, the management system will lose contact with any business customer

modems that it is managing. With the flood of events and error messages from the

managed elements, it will be very obvious that problems exist with the service; the

challenge is to translate all that data into root cause and service impact actionable

information.

SPECTRUM can interpret the resulting deluge of events by using the service area of the

seemingly disparate events as a factor in the Condition Correlation. If the service areas and

services are modeled in SPECTRUM, it can use Condition Correlation to determine which

services in which areas are affected and the root cause or causes.




Service impact relevance goes beyond understanding what is impacted; it is also important

to identify what is not impacted. It is possible for the video subscription service to fail to

deliver VoD content to a single service area, and yet all other services to that area could be

operating normally. In another case, a return path problem in one area could cause

Internet, VoIP, and VOD services to fail and digital cable to degrade, yet analog cable would

still function normally. With the SPECTRUM capabilities and views of your infrastructure,

you can more quickly and easily detect the root cause and focus on addressing thatproblem first.

Leverage the Integrated Solution

After SPECTRUM has identified the root cause problems, operation staff can quickly obtain

details about the problem history and troubleshooting information by drilling down from the

alarms in the OneClick browser to the eHealth and eHealth for Voice reports and tools. For

example, operators could right-click an alarm in OneClick and do any of the following:

Drill down to Trend reports for the problem variable.

Drill down to At-a-Glance reports for a snapshot of several key performance variables for

that resource.

Drill down to the eHealth for Voice console or the eHealth web reporting interface to view

more reports and details about problem voice PBX systems and call message servers.

Launch a browser to the eHealth web reporting interface to view more reports and details

about problem resources.




Exceptions section, sending traps, 136Exceptions Summary Report, 145

F fault scenarios, 103

Frame Relay Manager, 41

G Global Collections, 58governments, 21Grade of service (GoS), 155group lists, purpose of, 60groups

adding to group list, 61creating, 60purpose of, 60

H Health reports, 65

forwarding traps, 65systems, 82

Healthcheck services, 29heterogeneous networks, 10

I Inductive Modeling Technology

(IMT), 162installation

prerequisites, 53

steps, 54InstallPlus kit, 55integrated solution, configuring, 57integration

eHealth SPECTRUM, 25modules (IMs), 36value, 11

IT resources, 18

L lifecycle of best practices, 26Live Exceptions

profiles, 63

service alarm situations, 136starting, 63

Live Health, 36forwarding traps to SPECTRUM, 64profiles, 62

Live Trend, 82

M Model by IP Address settings, 78

Modeling Gateway, 168models, SPECTRUM, 162Multicast Manager, 42MyHealth reports, 82

N neighbors, 169Network and Voice Management solution,

eHealth, 10network and voice management

strategy, 19network evolution, 13network fault and performance issues, 10Network Fault Management,

components, 38network management solutions, 10Network Performance Management,

components, 35network support, 10node licenses, 48

O OneClick

clearing alarms, 137SPECTRUM, 39

OneClick for eHealth console, 60Operational Support System (OSS), 20overutilized resources

confirming, 145documenting history of, 147locating, 145modeling changes for, 148resolving, 147

P predictive capacity planning, 33, 139proactive service assurance, 31, 135process rules, 102

Q QoS Manager, 42

R rapid problem resolution, 32, 159Remote Poller, 37Report Center, 37Report Manager

accessing, 118SPECTRUM, 45

reportsAt-a-Glance, 80Health, 65, 82MyHealth, 82




Top N, 86Trend, 84What-If Capacity Trend, 86

resource monitoring, 92response time tests, 105return on investment (ROI), 143RFC 2790

extensions, 72support for process modeling, 102RMON2 probes, 37role-based service information, 10root cause, 92

S scale, 10Secure Domain Manager (SDM), 44service assurance, 135Service Availability, 17

reports, 118service dashboard, 46

service delivery platform, 13Service Desk tickets, 137Service Editor, 103Service health, 92Service Health Matrix table, 94service hierarchies, 93service level management, 30service management

approach, 87interview process, 88mapping procedures, 89

Service Management module, 87Service Manager, 46service modeling, 92

creating, 93

fault scenarios, 103health table, 94policy design, 95process rule, 102response time, 105

Service Performance Manager (SPM), 45service providers, 20Situations to Watch chart, 149Situations to Watch Detail report, 150Sizing Wizard, 50guarantees, 108SLA

business hours, 109components, 115guarantees, 108

implementing, 116monitoring, 108periods, 108reports, 118

SLA modeling concepts, 108SNMPv3 support, 43software and hardware requirements

eHealth, 50OneClick, 51SPECTRUM, 50

Voice, 52Solution Architecture Overview (SAO), 27Solution Architecture Specification

(SAS), 28SpectroSERVER, 54SPECTRUM

backing up, 70

benefits, 25components, 24configuring eHealth, 68discovery, 57eHealth integration, 25OneClick, 39Service Manager, 30viewing alarms, 69Watch Editor, 40

SPECTRUM Alarm Notification Manager(SANM), 40

SPECTRUM Integrity, 39SPECTRUM Report Manager, 45syslog messages, 171system agents, monitoring best

practices, 71system monitoring, 71system requirements, 54SystemEDGE Agents, 71

T telecommunication service provider, 20third-party agents, 71Time over Threshold alarms, 31Time over Threshold rules, 136Top N reports, 86topology, 77, 167Traffic Accountant, 37

trapsforwarding from Health reports, 65forwarding from Voice, 66forwarding to SPECTRUM from

eHealth, 64Trend reports, 84

U underused resources

confirming, 141finding, 140resolving, 143return on investment, 143

Underutilized Elements report, 140best practices, 141

Unicenter NSM Agents, 71discovering in SPECTRUM, 79

V voice message disk capacity, 158voice services, network, 25




Voice, capacity planning, 154VPN Manager, 43

W Watch Editor, 40

watches, SPECTRUM, 40What-If reports

running, 152What-If reports, 86

Network and Voice Management Green Book ENU

Documents

Transcript of Network and Voice Management Green Book ENU