Cisco Mttr e Mtbf
-
Upload
edson-aquino-aquino -
Category
Documents
-
view
489 -
download
18
Transcript of Cisco Mttr e Mtbf
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
1© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
AVAILABILITY MEASUREMENT
SESSION NMS-2201
222© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Agenda
• Introduction
• Availability Measurement MethodologiesTrouble TicketingDevice Reachability: ICMP (Ping), SA Agent, COOL
SNMP: Uptime, Ping-MIB, COOL, EEM, SA Agent Application
• Developing an Availability ‘Culture’
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
333© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Associated Sessions
• NMS-1N01: Intro to Network Management• NMS-1N02: Intro to SNMP and MIBs
• NMS-1N04: Intro to Service Assurance Agent • NMS-1N41: Introduction to Performance Management
• NMS-2042: Performance Measurement with Cisco IOS®
• ACC-2010: Deploying Mobility in HA Wireless LANs• NMS-2202: How Cisco Achieved HA in Its LAN
• RST-2514: HA in Campus Network Deployments• NMS-4043: Advanced Service Assurance Agent
• RST-4312: High Availability in Routing
INTRODUCTIONWHY MEASURE AVAILABILITY?
4© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
555© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Why Measure Availability?
1. Baseline the network
2. Identify areas for network improvement
3. Measure the impact of improvement projects
666© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Why Should We Care About Network Availability?
• Where are we now? (baseline)
• Where are we going? (business objectives)
• How best do we get from where we are not to where we are going? (improvements)
• “What if, we can’t get there from here?”
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
777© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Why Should We Care About Network Availability?
• Percent of downtime that is unscheduled: 44%
• 18% of customers experience over 100 hours of unscheduled downtime or an availability of 98.5%
• Average cost of network downtime per year: $21.6 million or $2,169 per minute!
SOURCE: Sage Research, IP Service Provider Downtime Study: Analysis of Downtime Causes, Costs and Containment Strategies, August 17, 2001, Prepared for Cisco SPLOB
Recent Studies by Sage Research Determined ThatUS-Based Service Providers Encountered:
Downtime—Costs too Much!!!
7© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
888© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Cause of Network Outages
• Change management
• Process consistency
• Hardware• Links• Design• Environmental
issues• Natural disasters
Source: Gartner Group
Software andApplication
40%
User Errorand Process
40%
Technology20%
•Software issues•Performanceand load•Scaling
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
999© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Top Three Causes of Network Outages
• Congestive degradation• Capacity
(unanticipated peaks) • Solutions validation
• Software quality
• Inadvertent configuration change
• Change management
• Network design• WAN failure (e.g., major fiber
cut or carrier failure)• Power
• Critical services failure (e.g. DNS/DHCP)
• Protocol implementations and misbehavior
• Hardware fault
101010© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Method for Attaining a Highly-Available Network
• Establish a standard measurement method
• Define business goals as related to metrics
• Categorize failures, root causes, and improvements
• Take action for root cause resolution and improvement implementation
Or a Road to Five Nine’s
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
111111© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Where Are We Going? Or What Are Your Business Goals?
• FinancialROI Economic Value Added Revenue/Employee
• Productivity
• Time to market
• Organizational mission
• Customer perspectiveSatisfaction Retention Market Share
Define Your ‘End-State’?What Is Your Goal?
121212© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Why Availability for Business Requirements?
• Availability as a basis for productivity dataMeasurement of total-factor productivityBenchmarking the organizationOverall organizational performance metric
• Availability as a basis for organizational competency
Availability as a core competencyAvailability improvement as an innovation metric
• Resource allocation informationIdentify defectsIdentify root causeMeasure MTTR—tied to process
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
131313© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
It Takes a Design Effort to Achieve HA
Hardware and Software Design
Network andPhysical Plant Design
Process Design
INTRODUCTIONWHAT IS NETWORK AVAILABILITY?
14© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
151515© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
What Is High Availability?
30 Seconds99.9999%
5 Minutes99.999%
53 Minutes99.990%
23 Minutes4 Hours99.950%
46 Minutes8 Hours99.900%
48 Minutes19 Hours1 Day99.500%
36 Minutes15 Hours3 Days99.000%
Downtime per Year (24x7x365)Availability
High Availability Means an Average End User Will Experience Less than Five Minutes Downtime per Year
161616© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Availability Definition
• Availability definition is based on business objectives
Is it the user experience you are interesting in measuring?
Are some users more important than other?
• Availability groups? Definitions of different groups
• Exceptions to the availability definition
i.e. the CEO should never experience a ‘network’ problem
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
171717© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
How You Define Availability
• Define availability perspective (customer, business, etc.) • Define availability groups and levels of redundancy
• Define an outage• Define impact to network
Ensure SLAs are compatible with outage definition
Understand how maintenance windows affect outage definition
Identify how to handle DNS and DHCP within definition of Layer 3 outage
Examine component level sparing strategy
• Define what to measure• Define measurement accuracy requirements
181818© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Network DesignWhat Is Reliability?
• “Reliability” is often used as a general term that refers to the quality of a product
Failure rateMTBF (Mean Time Between Failures) or
MTTF (Mean Time To Failure)
Engineered availability
• Reliability is defined as the probability of survival (or no failure) for a stated length of time
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
191919© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
MTBF Defined
• MTBF stands for Mean Time Between Failure
• MTTF stands for Mean Time to FailureThis is the average length of time between failures (MTBF) or, to a failure (MTTF)
More technically, it is the mean time to go from an OPERATIONAL STATE to a NON-OPERATIONAL STATE
MTBF is usually used for repairable systems, and MTTF is used for non-repairable systems
• MTTR stands for Mean Time to Repair
202020© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
One Method of Calculating Availability
• Availability = MTBF(MTBF + MTTR)
• What is the availability of a computer with MTBF = 10,000 hrs. and MTTR = 12 hrs?
A = 10000 ÷ (10000 + 12) = 99.88%
• Annual uptime8,760 hrs/year X (0.9988)= 8,749.5 hrs
• Conversely, annual DOWN time is,8,760 hrs/year X (1- 0.9988)= 10.5 hrs
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
212121© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Networks Consist of Series-Parallel
• Combinations of in-series and redundantcomponents
D1D1
D2D2
D3D3
EE FFCCB1B1
B2B2AA
RBD
1/2 2/3
222222© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
More Complex Redundancy
• Pure active parallelAll components are on
• Standby redundantBackup components are not operating
• Perfect switchingSwitch-over is immediate and without fail
• Switch-over reliabilityThe probability of switchover when it is not perfect
• Load sharingAll units are on and workload is distributed
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
MEASURING THE PRODUCTION NETWORK
23© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
242424© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Reliability or Engineered Availability vs. Measured Availability
1. Reliability is an engineered probability of the network being available
2. Measured Availability is the actual outcome produced by physically measuring over time the engineered system
Calculations Are Similar—Both Are Based on MTBF and MTTR
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
252525© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Availability Choice Based onBusiness Goals
• Passive availability measurement(Without sending additional traffic on the production network using data from problem management, fault management, or another system)
• Active availability measurement(With traffic being sent specifically for availability measurement using ICMP echo, SNMP, SA agent, etc. to generate data)
262626© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Types of Availability
• Device/interface
• Path
• Users
• Application
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
272727© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Some Types of Availability Metrics
• Mean Time to Repair (MTTR)
• Impacted User Minutes (IUM)
• Defects per Million (DPM)
• MTBF (Mean Time Between Failure)
• Performance (e.g. latency, drops)
282828© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Back to How Availability Is Calculated?
• Availability (%) is calculated by tabulating end user outage time, typically on a monthly basis
• Some customers prefer to use DPM (Defects per Million) to represent network availability
Availability (%) = (Total User Time – Total User Outage Time) X 102
Total User TimeDPM = Total User Outage Time X 106
Total User TimeTotal User Time = Total # of End Users X Time in Reporting PeriodTotal User Outage Time = Σ(# of End Users X Outage Time in Reporting Period)Σ Is over All the Incidents in the Reporting PeriodPorts or Connections May Be Substituted for “End Users”
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
292929© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Defects per Million
• Started with mass produced items like toasters
• For PVCs,DPM = Σ (#conns*outage minutes)
Σ (#conns*total minutes)
• For SVCs or phone calls,DPM = Σ (#existing calls lost + #new calls blocked)
total calls attempted
• For connectionless traffic (application dependent),DPM = Σ (#end users*outage minutes)
Σ (#end users*total minutes)
NETWORK AVAILABILITY COLLECTION METHODSTROUBLE TICKETING METHODS
30© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
313131© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Availability Improvement Process
• Step IValidate data collection/calculation methodologyEstablish network availability baselineSet high availability goals
• Step IIMeasure uptime ongoingTrack defects per million (DPM) or IUM or availability (%)
• Step IIITrack customer impact for each ticket/MTTRCategorize DPM by reason code andbegin trendingIdentify initiatives/areas for a focus toeliminate defects
323232© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Data Collection/Analysis Process
• Understand current data collection methodologyCustomer internal ticket databaseManual
• Monthly collection of network performance data and export the following fields to a spreadsheet or database system:
Outage start time (date/time)Service restore time (date/time)Problem descriptionRoot causeResolutionNumber of customers impactedEquipment modelComponent/partPlanned maintenance activity/unplanned activityTotal customers/ports on network
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
333333© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Network Availability Results
• Methodology and assumptions must be documented
• Network availability should include:Overall % network availability (baseline/trending)Conversion of downtime to DPM by:
Planned and unplannedRoot causeResolutionEquipment type
Overall MTTRMTTR by:
Root causeResolutionEquipment type
• Results are not necessarily limited to the above but should be customized based on your network and requirements
343434© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Availability = 1 - 8 x 24 .100 x 24 x 365
DPM = 8 x 24 x 106
100 x 24 x 365
MTBF = 24 x 365 .8
MTTR = 1095 x (1-0.978082) .0.978082
= 219.2 failures for every 1 million user hours
= 0.978082
= 1095 (hours)
= 0.24 (hours)
Availability Metrics: Reviewed
• Network has 100 customers• Time in reporting period is one year or 24 hours x 365 days• 8 customers have 24 hours down time per year
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
TROUBLE TICKETING METHODSAMPLE OUTPUT
35© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
363636© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Network Availability
99.5099.55
99.6099.65
99.7099.7599.80
99.8599.90
99.95100.00
July Aug Sept Oct Nov Dec Jan Feb Mar Apr May Jun
Overall Network Availability(Planned/Unplanned)
• Key takeaways
Illustra
tive
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
373737© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Platform Related DPM Comparison
• Platform related DPM contributed• 13% of total DPM in September• Platform DPM includes events from:
BackboneNASPGPOPRadius ServerVPN Radius Server
• All other events are included in the “Other” category
Breakdown of Platform Related DPM
• Network Access Server (NAS) accounts for 50% of the total Platform related DPM in September
• Private Access Gateway (PG) showing significant decrease over the past 3 months
52.610482.549.2Total Platform Related3.42.88.80VPN Radius.31.200Radius Server1.6.53.90POP18.956.859.626PG26.12719.421.7NAS2.315.7.81.5BackboneSeptAugJulyJune
0
100
200
300
400
500
600
June July Aug Sept Oct Dec
100
Nov
100
Oct
100
414.8
52.6
362.2Sept
100
Dec
100100100------99.99% Target
498.7507.4388.7Total DPM
10482.549.2Platform Related
394.7424.9339.5OtherAugJulyJune
DPM
Illustra
tive
383838© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
0
500
1000
1500
2000
2500
Dec Jan Feb Mar Apr May
DPM
1964.81641.91293.112261202.23789.3TOTAL20.2474.3
3789.7
087.7
Mar
106.6422.5314.2
19133.410680
Apr
201117.5101.6406Config/SW240553.6512.7884.3HW
604.4212.4136.2145.7Other14.811.131.4566.1Power12718.468.836.1Environmental
115.28.9823.618.2Human Error95.2UnknownMayFebJanDec
Illustra
tive
DPM by Cause
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
393939© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
MTTR Analysis: Hardware Faults
• Number of faults increased slightly in September however MTTR decreased 49% of faults resolved in < 1 Hour in September
• 11% of faults resolved in > 24 hours with an additional 3% >100 Hhours
Produce for Each Fault TypeRouter HW
12.42
15.1
8.497.19
0
2
4
6
8
10
12
14
16
Jun Jul Aug Sep Oct Nov Dec
Hou
rs
0
20
40
60
80
100
120
140
Jun Jul Aug Sep Oct Nov Dec
# of
Fau
lts
>100
>24 Hr
12-24 Hr
4-12 Hr
1-4 Hr
<1 Hr
0102030405060708090
100
Jun Jul Aug Sep Oct Nov Dec
# of
Tot
al
>100
>24 Hr
12-24 Hr
4-12 Hr
1-4 Hr
<1 Hr
Illustra
tive
404040© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Unplanned DPM
• Key take-a-ways • Action plansIdentify areas of focus to enable reduction of DPM to achieve network availability goal
0100200300400500600700800900
1000
Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
170401101010
Nov
350651159080Jul
4408018010080
Jun
3506710410079
May
4601002457540Oct
960200385210165Aug
760145325180110Sep
40220520310TOTAL105014060SW58020090HW5558090Process03510070Other
DecAprMarFeb
Illustra
tive
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
414141© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Trouble Ticketing Method
• ProsEasy to get startedNo network overhead
Outages can be categorized based on event
• ConsSome internal subjective/consistency process issues
Outages may occur that are not included in the trouble ticketing systemsResources needed to scrub data and create reports
May not work with existing trouble ticketing system/process
Network Availability Collection Methods
AUTOMATED FAULT MANAGEMENT EVENTS METHOD
42© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
434343© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Availability Improvement Process
• Step IDetermine availability goals
Validate fault management data collection
Determine a calculation methodology
Build software package to use customer event log
• Step IIEstablish network availability baseline
Measure uptime on an ongoing basis
• Step IIITrack root cause and customer impact
Begin trending of availability issues
Identify initiatives and areas of focusto eliminate defects
444444© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Event Log ExampleFri Jun 15 11:05:31 2001 Debug: Looking for message header ...Fri Jun 15 11:05:33 2001 Debug: Message header is okayFri Jun 15 11:05:33 2001 Debug: $(LDT) -> "06152001110532"Fri Jun 15 11:05:33 2001 Debug: $(MesgID) -> "100013"Fri Jun 15 11:05:33 2001 Debug: $(NodeName) -> "ixc00asm"Fri Jun 15 11:05:33 2001 Debug: $(IPAddr) -> "10.25.0.235"Fri Jun 15 11:05:33 2001 Debug: $(ROCom) -> "xlr8ed!"Fri Jun 15 11:05:33 2001 Debug: $(RWCom) -> "s39o!d%"Fri Jun 15 11:05:33 2001 Debug: $(NPG) -> "CISCO-Large-special"Fri Jun 15 11:05:33 2001 Debug: $(AlrmDN) -> "aSnmpStatus"Fri Jun 15 11:05:33 2001 Debug: $(AlrmProp) -> "system"Fri Jun 15 11:05:33 2001 Debug: $(OSN) -> "Testing"Fri Jun 15 11:05:33 2001 Debug: $(OSS) -> "Normal"Fri Jun 15 11:05:33 2001 Debug: $(DSN) -> "SNMP_Down"Fri Jun 15 11:05:33 2001 Debug: $(DSS) -> "Agent_Down"Fri Jun 15 11:05:33 2001 Debug: $(TrigName) -> "NodeStateUp"Fri Jun 15 11:05:33 2001 Debug: $(BON) -> "nl-ping"Fri Jun 15 11:05:33 2001 Debug: $(TrapGN) -> "-2"Fri Jun 15 11:05:33 2001 Debug: $(TrapSN) -> "-2“
Event Log
• Analysis of events received from the network devices
• Analysis of accuracy of the data
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
454545© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Calculation Methodology: Example
• Primary events are device down/up
• Down time is calculated based on device-type outage duration
• Availability is calculated based on the totalnumber of device types, the total time, and thetotal down time
• MTTR numbers are calculated from average duration of downtime
• With MTTR the shortest and longest outage provides a simplified curve
464646© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Automated Fault Management Methodology
• ProsOutage duration and scope can be fairly accurateCan be implemented within a NMS fault management systemNo additional network overhead
• ConsRequires an excellent change management/provisioning processRequires an efficient and effective fault management systemRequires a custom developmentDoes not account for routing problems Not “true” end-to-end measure
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
NETWORK AVAILABILITY DATA COLLECTIONSAMPLE OUTPUT
47© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
484848© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Automated Fault Management:Example Reports
18.726:38:110:23:100:00:2099.9170%.0830%844:59:1626478018GRAND TOTAL
16.842:16:100:26:070:00:1799.9491%.0509%212:29:46173897OtherTotals
14.909:49:350:22:360:00:2499.8691%.1309%430:02:0316734732NetworkTotals
24.427:48:460:20:470:00:1999.9327%.0673%202:27:278012389HostTotals
Events per
Device
Longest Outage
Duration
Mean Time to Repair
Shortest Outage
Duration%Up
%Down
Total Down Time
hhh:mm:ssCount of Incidents
# of Devices
Device Type
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
494949© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Count of Incidents
Automated Fault Management:Example Reports (2)
Other Totals11% Host Totals
30%
NetworkTotals59%
Host TotalsNetwork TotalsOther Totals
Other Totals7% Host Totals
30%
NetworkTotals63%
Host TotalsNetwork TotalsOther Totals
Total Down TimeOther Totals
25% Host Totals24%
NetworkTotals51%
Host TotalsNetwork TotalsOther Totals
Number of Managed Devices
Network Availability Collection Methods
ICMP ECHO (PING) AND SNMP AS DATA GATHERING TECHNIQUES
50© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
515151© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Data Gathering Techniques
• ICMP ping
• Link and device polling (SNMP)
• Embedded RMON
• Embedded event management
• Syslog messages
• COOL
525252© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Data Gathering Techniques
• Method definition: Central workstation or computer configured to send ping packets to the network edges(device or ports) to determine reachability
• How: Edge interfaces and/or devices are defined and “pinged” on a determined interval
• Unavailability: Pre-defined, non-response from the interface
ICMP Reachability
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
535353© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Availability Measurement Through ICMP
Periodic ICMP Test
Periodic Pings to Network Devices Period Ping to Network Leaf Nodes
545454© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Data Gathering Techniques
• ProsFairly accurate “network availability”
Accounts for routing problems
Can be implemented for fairly low network overhead
• ConsPoint to multipoint implies not “true” end-to-end measure
Availability granularity limited by ping frequencyMaintenance of device database…must have a solid change management and provisioning process
ICMP Reachability
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
555555© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Data Gathering Techniques
• Method definition:SNMP polling and trapping on links, edge ports, or edge devices
• How:An agent is configured to SNMP poll and tabulate outage times for defined devices or links; database maintains outage times and total service time; sometimes trap information is used to augment this method by providing more accurate information on outages
• Unavailability: Pre-defined, non-redundant links, ports, or devices thatare down
Link and Device Status
565656© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Polling Interval vs. Sample Size
• Polling interval is the rate at which data is collected from the network
Polling interval = 1 Sampling Rate
• The smaller the polling interval the more detailed (granular) the data collected
Example polling data once every 15 minutes provides 4 times the detail (granularity) of polling once an hour
• A smaller polling interval does not necessarily provide a better margin of error
Example polling once every 15 minutes for one hour, has the same margin of error as polling once an hour for 4 hours
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
575757© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Link and Device Status Method
• Method definitionSNMP polling and trapping on links, edge ports, or edge devices
• How:Utilizing existing NMS systems that are currently SNMP polling to tabulate outage times for defined devices or links
A database maintains outage times and total service time
SNMP Trap information is also used to augment this method by providing more accurate information on outages
585858© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Link and Device Status Method
• ProsOutage duration and scope can be fairly accurateUtilize existing NMS systemsLow network overhead
• ConsNo canned SW to do this; …custom developmentMaintaining element device database challengingRequires an excellent change mgmt and provisioning processDoes not account for routing problemsNot a “true” end-to-end measure
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
CISCO SERVICE ASSURANCE AGENT (SA AGENT)
59© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
606060© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Service Assurance Agent
• Method Definition:SA Agent is an embedded feature of Cisco IOS software and requires configuration of the feature on routers within the customer network; use of the SA agent can provide for a rapid, cost-effective deployment without additional hardware probes
• How: A data collector creates SA Agents on the routers to monitor certain network/service performances; the data collector then collects this data from the routers, aggregates it and makes it available
• Unavailability: Pre-defined paths with reporting on non-redundant links, ports, or devices that are down within a path
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
616161© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Case Study: Financial Institution (Collection)
SA Agent Collectors
Remote Sites
DNS
Internet Web Sites
626262© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Availability = 1 - Probes with No Response Total Probes Sent
DPM = Probes with No Response x 106
Total Probes Sent
Availability Using Network-Based Probes
• DPM equations used with network-based probes as input data• Probes can be
Simple ICMP Ping probe, modified Ping to test specific applications,Cisco IOS SA Agent
• DPM will be for connectivity between 2 points on the network, the source and destination of probe
Source of probe is usually a management system and the destination are the devices managed
Can calculate DPM for every device managed
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
636363© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
DPM = 1 x 106
10000= 100 probes out of 1 million will fail
Availability = 1 - 1 .10000 = 0.9999
Availability Using Network-Based Probes: Example
• Network probe is a ping
• 10000 probes are sent between management system and managed device
• 1 probe failed to respond
646464© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Sample Size
• Sample size is the number of samples that have been collected
• The more samples collected the higher the confidence that the data accurately represents the network
• Confidence (margin of error) is defined by
• Example data is collected from the network every 1 hourAfter One Day After One Month
0367.03124
1m =x
=2041.0241m ==
sizesample1
m =
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
656565© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Service Assurance Agent
• ProsAccurate “network availability” for defined “paths”Accounts for routing problems
Implementation with very low network overhead
• ConsRequires a system to collect the SAA data
Requires implementation in the router configurations
Availability granularity limited by polling frequencyDefinition of the critical network paths to be measured
COMPONENT OUTAGE ONLINE MEASUREMENT (COOL)
66© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
676767© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
COOL Objectives
• To automate the measurement to increase operational efficiency and reduce operational cost
• To measure the outage as close to the source of outage events as possible to pin point the cause of the outages
• To cope with large number of network elementswithout causing system and network performance degradation
• To maintain measurement data reliably in presents of element failure or network partition
• To support simplicity in deployment, configuration, and data collection (autonomous measurement)
686868© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
COOL Features
NetToolsNetTools3rd Party Tools3rd Party Tools
Customer Equipment
Access Router
NMS
C-NOTEC-NOTE
PNLPNL
COOL Embedded in Router
Automated Real-Time MeasurementAutonomous Measurement
Outage Data Stored in Router
Outage Monitor MIB Open access via Outage Monitor MIBEvent Notification Filtering
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
696969© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
COOL Features (Cont.)
• Support NMS or tools for such applications as
Calculation of software or hardware MTBF, MTTR, availability per object, device,or networkVerification of customer’s SLATrouble shooting in real-time
• Two-tier frameworkReduces performance impact on the routerProvides scalability to the NMSMakes easy to deployProvides flexibility to availability calculation
NMS
Customer Equipment
NMS
COOL
Outage Monitor MIB
Access Routers
Access RouterCore Router
Out
age
Mon
itorin
g an
dM
easu
rem
ent
Out
age
Cor
rela
tion
and
Cal
cula
tion
NMS
COOL
Outage Monitor MIB
707070© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
A
DD
RP
Power Fan,Etc.
PhysicalInterface
LogicalInterface
Access Router
Outage ModelC
B
Failure of Remote Device (Customer Equipment or Peer Networking Device) or Link In-betweenRemote ObjectsC
Failure of Software Processes Running on the RPs and Line CardsSoftware ObjectsD
Interface Hardware or Software Failure, Loss of SignalInterface ObjectsB
Component Hardware or Software Failure Including the Failure of Line Card, Power Supplies, Fan, Switch Fabric, and So on
Physical Entity ObjectsA
Failure ModesObjects MonitoredType
NetworkManagement
System
CustomerEquipmentMUX/
Hub/Switch
PeerRouter
LinkLink
A
DD
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
717171© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Outage Characterization
• Data DefinitionDefect threshold: a value across which the object is considered to be defective (service degradation or complete outage)
Duration threshold: the minimum period beyond which an outage needs to be reported (given SLA)
Start time: when the object outage starts
End time: when the outage ends
Down Event
Up Event
Outage Duration
DurationThreshold
DefectThreshold
Start Time End Time
Time
727272© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Architecture
Outage Outage ManagerManager
Internal Component Internal Component Outage DetectorOutage Detector
Fault Manager(IOS)
Event Source
Callbacks Syslog
Remote Component Outage Detector
Remote Component Outage Detector
Customer Equipment Detection Function Ping SAA
APIs
Data Table StructureData Table Structure HA and Persistent Data StoreHA and Persistent Data Store
Time Stamp Temp Event DataCrash Reason
Outage Data
NVRAM
ATA Flash
Outage Monitor MIBOutage Monitor MIB
SNMP Polling SNMP Notification
ConfigurationConfigurationCustomer
AuthenticationCLI
Baseline Optional
CPU UsageDetect
Outage Component Table
Event History Table
Event Map TableProcess Map Table
Remote Component Map Table
Measurement Metrics
Customer Interfaces
Measurement Methods
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
737373© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Outage Data: AOT and NAF
• Requirements of measurement metrics:Enable calculation of MTTR, MTBF, availability, and SLA assessment
Ensure measurement efficiency in terms of resource (CPU, memory, and network bandwidth)
• Measurement metrics per object:AOT: Accumulated Outage Time since measurement started
NAF: Number of Accumulated Failures since measurement started
AOT = 20 and NAF = 2
Router 1
Time10 10
System Crash System Crash
Down
Up
747474© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Outage Data: AOT and NAF
• Object containment model
• Containment independent propertyRouter Device
AOT = 20;NAF = 2;
Service Affecting AOT = 27;NAF = 3;
Interface AOT = 7;NAF = 1;
Interface 1Interface Failure
202077
20
Router 1 Interface 1
Router Device
Line Card
Physical InterfaceLogical Interface
Router 1
Time10 10
System Crash System Crash
Down
Up
Time10 10
Up7
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
757575© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Example: MTTR
• Find MTTR for Object iMTTRi = AOTi/NAFi
= 14/2
= 7 min
Object i
Time10 min. 4 min.
Measurement Interval (T2–T1)
Failure FailureT1 T2
TTR TTR
DownUp
767676© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Example: MTBF and MTTF
• Find MTBF and MTTF for Object i
MTBF = 700,000 = 1,400,000/2
MTTR = 699,993 = (700,000 – 7)
MTBFi = (T2 – T1)/NAFi MTTFi = MTBFi – MTTRi = (T2 – T1 – AOTi)/NAFi
Object i
Time10 min. 4 min.
Measurement Interval (T2–T1)
Failure FailureT1 T2
TTR TTF
DownUp
TBF
(T2–T1) = 1,400,000 min
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
777777© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Time10 min. 4 min.
Failure FailureT1 T2
DownUp
Example: Availability and DPM
• Find availability and DPM for Object i
Availability = 99.999% = (700,000/700,007) * 100
DPMi = [AOTi/(T2 – T1)] x 106 = 10 DPM
Object iMeasurement Interval = 1,400,000 min.
Availability (%) = MTBFMTBF + MTTR * 100
787878© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Planned Outage Measurement
• To capture operation CLI commands both “reload” and “forced switchover”
• There is a simple rule to derive an upper bound of theplanned outage
If there is no “NVRAM soft crash file”, check the “reboot reason” or “switchover reason”
If it’s “reload” or “forced switchover”, it can be considered as an upper bound of the planned outage
Send BreakSend Break
Reload
Forced Switchover
Planned Outage
Operation Caused Outage
Upper Bound of the Planned Outage
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
797979© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Event Filtering
• Flapping interface detection and filtering:Some faulty interface state can be keep changing up and down
May cause virtual network disconnection
May occurs event storm when hundreds of messages for eachflapping event
May make the object MTBF unreasonably low due to frequentshort failures
This unstable condition needs to get operator’s attention
COOL detects the flapping status
Catching very short outage event (less than the duration threshold)
Increasing the event counter,
Flapping status, if it becomes over the flapping threshold (3 event counter) for the short period (1 sec); sends a notification
Stable status, if it becomes less than the threshold; sends another notification
808080© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Data Persistency and Redundancy
• Data persistencyTo avoid data loss due to link outage or router itself crash
• Data redundancy To continue the outage measurement after the switchoverTo retain the outage data even if the RP is physically replaced
Copy
NVRAM
RAMOutage Data
FLASHPersistent
Outage Data
NVRAM
RAMOutage Data
FLASHPersistent
Outage Data
Copy
Active RP Standby RP
COOLCOOL
Router
PersistentOutage Data
PersistentOutage Data
Periodic Update
Event Driven Update
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
818181© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Outage Monitor MIB
(Physical Entity Object Description)
(Interface Object Description)
ifTable
entPhysicalTable
(Process Object Description)
cpmProcessTable
CISCO-OUTAGE-MONITOR-MIB
cOutageHistoryTable
cOutageObjectTable
Remote Object Map Table(Remote Object Description)
Object-Type;Object-Index;
Event-Reason-Index;Event-Time;Event-Interval;
Object-Type;Object-Index;
Object-Status;Object-AOT;Object-NAF;
IF-MIB
ENTITY-MIB
CISCO-PROCESS-MIB
Iso.org.dod.internet.private.enterprise.cisco.ciscoMgmt.ciscoOutageMIB1.3.6.1.4.1.9.9.280
Event Reason Map Table(Event Description)
Process MIB Map
828282© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Configuration
MIB Display
Customer EquipmentDetection Function
Cisco IOSConfigurationCOOL
Update
Update
Show CLI
run;add;removalfiltering-enable;
Config CLI
Show event-tableShow object-table
Object TableEvent Table
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
838383© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Enabling COOLari#dirDirectory of disk0:/
1 -rw- 19014056 Oct 29 2003 16:09:28 +00:00 gsr-k4p-mz.120-26.S.bin
128057344 bytes total (109051904 bytes free)ari#copy tftp disk0:Address or name of remote host []? 88.1.88.9Source filename []? auth_fileDestination filename [auth_file]? Accessing tftp://88.1.88.9/auth_file...Loading auth_file from 88.1.88.9 (via FastEthernet1/2): ![OK - 705 bytes]
705 bytes copied in 0.532 secs (1325 bytes/sec)ari#clear cool perari#clear cool persist-files ari#conf tEnter configuration commands, one per line. End with CNTL/Z.
ari(config)#cool run
ari(config)#^Zari#wr memBuilding configuration...[OK][OK][OK]
Obtain Authorization
File
Enable COOL
848484© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
COOL
• ProsAccurate “network availability” for devices, components, and softwareAccounts for routing problems Implementation with low network overhead.Enables correlation between active and passive availability methodologies
• ConsOnly a few system currently have the COOL featureRequires implementation in the router configurations of production devicesAvailability granularity limited by polling frequencyNew Cisco IOS Feature
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
Network Availability Collection Methods
APPLICATION LAYER MEASUREMENT
85© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
868686© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Application ReachabilitySimilar to ICMP Reachability
• Method definition: Central workstation or computer configured to send packets that mimic application packets
• How: Agents on client and server computers and collecting data
Fire Runner, Ganymede Chariot, Gyra Research, Response Networks, Vital Signs Software, NetScout, Custom applications queries on customer systems
Installing special probes located on user and server subnets to send, receive and collect data; NikSun and NetScout
• Unavailability: Pre-defined QoS definition
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
878787© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Application Reachability
• ProsActual application availability can be understoodQoS, by application, can be factored into the availability measurement
• ConsDepending on scale, potential high overhead and cost can be expected
DATA COLLECTION FOR ROOT CAUSE ANALYSIS (RCA) OF NETWORK OR DEVICE DOWNTIME
88© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
898989© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Data Gathering Techniques
• Alarm and event
• History and statistics
• Set thresholds in router configuration
• Configure SNMP trap to be sent when MIB variable rises above and/or falls below a given threshold
• Alleviates need for frequent polling
• Not an availability methodology by itself but can add valuable information and customization to the data collection method
Cisco IOS Embedded RMON
909090© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Data Gathering Techniques
• Provide information on what the router is doing
• Categorized by feature and severity level
• User can configure Syslog logging levels
• User can configure Syslog messages to be sent as SNMP traps
• Not an availability methodology by itself but can add valuable information and customization to the data collection method
Syslog Messages
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
919191© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Expression and Event MIB
Expression MIB• Allows you to create new SNMP objects based upon formulas• MIB persistence is supported – a MIB’s SNMP data persists across
reloads • Delta and wildcard support allows you to:
Calculate utilization for all interfaces with one expressionCalculate errors as a percentage of traffic
Event MIB• Allows you to create custom notifications and log them and/or send
them as SNMP traps or informs• MIB persistence is supported – a MIB’s SNMP data persists across
reloads • Can be used to test objects on other devices• More flexible than RMON events/alarms
RMON is tailored for use with counter objects
929292© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Data Gathering Techniques
• Underlying philosophy: Embed intelligence in routers and switches to enable a scalable and distributed solution, with OPEN interfaces for NMS/EMS leverage of the features
• Mission statement:Provide robust, scalable, powerful, and easy-to-use embedded managers to solve problems such as syslog and event management within Cisco routers and switches
Embedded Event Manager
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
939393© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Embedded Event Manager (Cont.)
• Development goal: predictable, consistent, scalable management
DistributedIndependent of central management system
• Control is in the customer’s handsCustomization
• Local programmable actions:Triggered by specific events
949494© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
EEMPoliciesEEM
Policies
Cisco IOS Embedded Event Manager:Basic Architecture (v1)
Event Detector Feeds EEMEvent Detector Feeds EEM
Embedded Event Manager EEMPolicies
Notify
SyslogEvent Detector
OtherEvent Detector
Switch-over Reload
Actions
NetworkKnowledge
SNMPEvent Detector
Syslog EventSyslog Event SNMP DataSNMP Data Other EventOther Event
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
959595© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
EEM Versions
• EEM Version 1Allows policies to be defined using the Cisco IOS CLI appletThe following policy actions can be established:
Generate prioritized syslog messagesGenerate a CNS event for upstream processing by Cisco CNS devicesReload the Cisco IOS softwareSwitch to a secondary processor in a fully redundant hardware configuration
• EEM Version 2EEM Version 2 adds programmable actions using the Tclsubsystem within Cisco IOSIncludes more event detectors and capabilities
969696© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
PosixPosixProcessProcessManagerManager
IOS ProcessIOS ProcessWatchdogWatchdog
SyslogSyslogDaemonDaemon
SystemSystemManagerManager
WatchdogWatchdogSysmonSysmon
HAHARedundancyRedundancy
FacilityFacility
SyslogSyslog
SystemSystemManagerManager
TimerTimerServicesServices CountersCounters
InterfaceInterfaceCounters andCounters and
StatsStats
RedundancyRedundancyFacilityFacility
SNMPSNMP
IOS SubsystemsSubscribers to
Receive Application Events, Publishes Application Events Using Application
Specific Event Detector
Tcl Shell
EEM PolicySubscribers to
Receive Events, Implements Policy
Actions
Embedded EventEmbedded EventManager ServerManager Server
ApplicationSpecific
Event Detector
Event Detectors
EventSubscriber
Event Publishers
EEM Version 2 Architecture
• More event detectors!
• Define policies or “programmable local actions” using Tcl
• Register policy with EEM Server
• Events trigger policy execution
• Tcl extensions for CLI control and defined actions
Cisco Internal Use Only 96Cisco Internal Use Only 9696
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
979797© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
What Does This Mean to the Business?
• Better problem determinationWidely applicable scripts from Cisco engineering and TACAutomated local action triggered by eventsAutomated data collection
• Faster problem resolutionReduces the “next time it happens…please collect”Better diagnostic data to Cisco engineeringFaster identification and repair
• Less downtimeReduce susceptibility and Mean Time to Repair (MTTR)
• Better serviceResponsivenessPrevent recurrenceHigher availability
• Not an availability methodology by itself but can add valuable information and customization to the data collection method
INSTILLING AN AVAILABILITY CULTURE
98© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
999999© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Putting an Availability Program into Practice
• Track network availability
• Identify defects
• Identify root cause and implement fix
• Reduce operating expense by eliminating non value added work
How much does an outagecost today?
How much can i save thru process and product enhancements?
100100100© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
How Do I Start?
1. What are you using now?a. Add or modify trouble ticketing analysis
b. Add or improve active monitoring method
2. Process—analyze the data!a. What caused an outage?
b. Can a root cause be identified and addressed?
3. Implement improvements or fixes
4. Measure the results5. Back to step 1—are other metrics
needed?
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
101101101© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
If You Have a Network Availability Method
• Use the current method and metric for improvementDon’t try to change completelyUse incremental improvements
Develop additional methods to gather data as identified
• Concentrate on understanding unavailability causes—All unavailability causes should be classified at a minimum under:
Change, SW, HW, power/facility, or link
• Identify the actions to correct unavailability causes i.e., network design, customer process change, HW MTBF improvement, etc.
102102102© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Multilayer Network Design
Distribution
Access
Core/Backbone
WAN Internet PSTN
Server Farm
Building BlockAdditions
Core
SA Agent Between Access and Distribution
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
103103103© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Distribution
Access
Core/Backbone
WAN Internet PSTN
Server Farm
Building BlockAdditions
Core
Multilayer Network DesignSA Agent between
Servers and WAN Users
104104104© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Distribution
Access
Core/Backbone
WAN Internet PSTN
Server Farm
Building BlockAdditions
Core
Multilayer Network DesignCOOL for High-
End Core Devices
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
105105105© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Distribution
Access
Core/Backbone
WAN Internet PSTN
Server Farm
Building BlockAdditions
Core
Multilayer Network DesignTrouble
Ticketing Methodology
AVAILABILITY MEASUREMENT SUMMARY
106© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
107107107© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Summary
• Availability metric is governed by your business objectives
• Availability measurement’s primary goal is:To provide an availability baseline (maintain)To help identify where to improve the networkTo monitor and control improvement projects
• Can you identify ‘Where you are now?’ for your network?
• Do you know ‘Where you are going?’ as network oriented business objectives?
• Do you have a plan to take you there?
108108108© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Complete Your Online Session Evaluation!
WHAT: Complete an online session evaluation and your name will be entered into a daily drawing
WHY: Win fabulous prizes! Give us your feedback!
WHERE: Go to the Internet stations located throughout the Convention Center
HOW: Winners will be posted on the onsiteNetworkers Website; four winners per day
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
109© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
110110110© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Recommended Reading
• Performance and Fault Management
ISBN: 1-57870-180-5
• High Availability Network Fundamentals
ISBN: 1-58713-017-3
• Network Performance Baselining
ISBN: 1-57870-240-2
• The Practical Performance Analyst
ISBN: 0-07-912946-3
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
111111111© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Recommended Reading (Cont.)
• The Visual Display of Quantitative Informationby Edward Tufte (ISBN: 0-9613921-0)
• Practical Planning for Network Growthby John Blommers (ISBN: 0-13-206111-2)
• The Art of Computer Systems Performance Analysisby Raj Jain (ISBN: 0-421-50336-3)
• Implementing Global Networked Systems Management: Strategies and Solutions
by Raj Ananthanpillai (ISBN: 0-07-001601-1)
• Information Systems in Organizations: Improving Business Processes
by Richard Maddison and Geoffrey Darnton (ISBN: 0-412-62530-X)
• Integrated Management of Networked Systems—Concepts, Architectures, and Their Operational Application
by Hegering, Abeck, Neumair (ISBN: 1558605711)
112112112© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Appendix A: Acronyms
• AVG—Average• ATM—Asynchronous Transfer Mode• DPM—Defects Per Million• FCAPS—Fault, Config, Acct, Perf,
Security• GE—Gigabit Ethernet• HA—High Availability• HDLC—High Level Data Link Control• HSRP—Hot Standby Routing
Protocol• IPM—Internet Performance Monitor• IUM—Impacted User Minutes• MIB—Management Information Base
• MTBF—Mean Time Between Failure• MTTR—Mean Time to Repair• RME—Resource Manager Essentials• RMON—Remote Monitor• SA Agent—Service Assurance Agent• SNMP—Simple Network Management
Protocol• SPF—Single Point of Failure; Shortest
Path First (routing protocol)• TCP—Transmission Control Protocol
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
BACKUP SLIDES
113© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
ADDITIONAL RELIABILITY SLIDES
114© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
115115115© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Network DesignWhat Is Reliability?
• “Reliability” is often used as a general term that refers to the quality of a product
Failure RateMTBF (Mean Time Between Failures) or
MTTF (Mean Time to Failure)
Availability
116116116© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Reliability Defined
1. The probability of survival (or no failure) for a stated length of time
2. Or, the fraction of units that will not fail in the stated length of time
A “mission” time must be stated
Annual reliability is the probability of survival for one year
Reliability:
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
117117117© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Availability Defined
1. The probability that an item (or network, etc.) is operational, and ready-to-go, at any point in time
2. Or, the expected fraction of time it is operational. annual uptime is the amount (in days, hrs., min., etc.) the item is operational in a year
Example: For 98% availability, the annual availability is 0.98 * 365 days = 357.7 days
Availability:
118118118© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
MTBF Defined
• MTBF stands for Mean Time Between Failure
• MTTF stands for Mean Time to FailureThis is the average length of time between failures (MTBF) or, to a failure (MTTF)
More technically, it is the mean time to go from an operational state to a non-operational state
MTBF is usually used for repairable systems, and MTTF is used for non-repairable systems
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
119119119© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
How Reliable Is It?
• “MTBF Reliability”:R = e-(MTBF/MTBF)
R = e-1 = 36.7%
• MTBF reliability is only 37%; that is, 63% of your HARDWARE fails before the MTBF!
• But remember, failures are still random!
120120120© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
MTTR Defined
• MTTR stands for Mean Time to Repairor
• MRT (Mean Restore Time)This is the average length of time it takes to repair an item
More technically, it is the mean time to go from a non-operational state to an operational state
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
121121121© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
One Method of Calculating Availability
• Availability = MTBF(MTBF + MTTR)
• What is the availability of a computer with MTBF = 10,000 hrs. and MTTR = 12 hrs?
A = 10000 ÷ (10000 + 12) = 99.88%
122122122© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Uptime
• Annual uptime8,760 hrs/year X (0.9988)= 8,749.5 hrs
• Conversely, annual DOWNtime is,8,760 hrs/year X (1- 0.9988)= 10.5 hrs
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
123123123© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Systems
• Components “In-Series”
• Components “In-Parallel” (Redundant)
Component 1 Component 2
Component 1
Component 2
RBD
124124124© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
In-Series
Part 1
Part 2
In-Series
Up Up Up
UpUp Up
Up Up Up Up
Down Down
Down Down
Down DownDown
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
125125125© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
In-Parallel
In-ParallelUp Down Up
Part 1
Part 2
Up Up Up
UpUp Up
Down Down
Down Down
126126126© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
In-Series MTBF
COMPONENT 1MTBF = 2,500 hrs.
MTTR = 10 hrs.
COMPONENT 2MTBF = 2,500 hrs.
MTTR = 10 hrs.
System Failure Rate= 0.0004 + 0.0004 = 0.0008
System MTBF= 1/(0.0008) = 1,250 hrs.
Component Failure Rate= 1/2500 = 0.0004
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
127127127© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
In-Series Reliability
System ANNUAL Reliability:
R = 0.03 X 0.03 = 0.0009
Component ANNUAL Reliability:R = e-(8760/2500) = 0.03
COMPONENT 1MTBF = 2,500 hrs.
MTTR = 10 hrs.
COMPONENT 2MTBF = 2,500 hrs.
MTTR = 10 hrs.
128128128© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
In-Series Availability
System Availability:
A = 0.996 X 0.996 = 0.992
Component Availability:A = 2500 ÷ (2500 + 10) = 0.996
COMPONENT 1MTBF = 2,500 hrs.
MTTR = 10 hrs.
COMPONENT 2MTBF = 2,500 hrs.
MTTR = 10 hrs.
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
129129129© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
In-Parallel MTBF
System MTBF*:= 2500 + 2500/2=
3,750 hrs.
COMPONENT 1
MTBF = 2,500 hrs.
COMPONENT 2
MTBF = 2,500 hrs.
In general*, ∑=
n
ii
MTBF
1*For 1-of-n Redundancy of n Identical Components with NO Repair or Replacement of Failed Components
130130130© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
1-of-4 Example
= 5,208 hrs.
*For 1-of-n Redundancy of n Identical Components with NO Repair or Replacement of Failed Components
In general*, ∑=
n
ii
MTBF
1
42500
32500
22500
12500
4
1
2500 +++=∑=i
i
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
131131131© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
In-Parallel Reliability
COMPONENT 1MTBF = 2,500 hrs.
MTTR = 10 hrs.
System ANNUAL Reliability:R= 1- [(1-0.03) X (1-0.03)] = 1-0.94 = 0.06
COMPONENT 1MTBF = 2,500 hrs.
MTTR = 10 hrs.
Component ANNUAL Reliability:R = e-(8760/2500) = 0.03 Unreliability
132132132© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
In-Parallel Availability
Unavailability
Component Availability:A = 2500 ÷ (2500 + 10) = 0.996
System Availability:A= 1- [(1-0.996) X (1-0.996)] = 1-0.000016 = 0.999984
COMPONENT 1MTBF = 2,500 hrs.
MTTR = 10 hrs.
COMPONENT 1MTBF = 2,500 hrs.
MTTR = 10 hrs.
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
133133133© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Complex Redundancy
1
2
3
n
m-of-n
.
.
.
Examples:
1-of-2
2-of-3
2-of-4
8-of-10
“Pure Active Parallel”
134134134© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
More Complex Redundancy
• Pure active parallelAll components are on
• Standby redundantBackup components are not operating
• Perfect switchingSwitch-over is immediate and without fail
• Switchover reliabilityThe probability of switchover when it is not perfect
• Load sharingAll units are on and workload is distributed
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
135135135© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Networks Consist of Series-Parallel
• Combinations of in-series and redundantcomponents
D1D1
D2D2
D3D3
EE FFCCB1B1
B2B2AA 2/31/2
136136136© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Failure Rate
• The number of failures per time:Failures/hourFailures/day
Failures/week
Failures/106 hours
Failures/109 hours ⇒ called “FITs” (“Failures in Time”)
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
137137137© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Approximating MTBF
• 13 units are tested in a lab for 1,000 hours with 2 failures occurring
• Another 4 units were tested for 6,000 hours with 1 failure occurring
• The failed units are repaired (or replaced)
• What is the approximate MTBF?
138138138© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Approximating MTBF (Cont.)
• MTBF = 13*1000 + 4*6000 1 + 2
= 37,000
3
= 12,333 hours
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
139139139© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Modeling
Distributions• Normal• Log-Normal
• Weibull
• Exponential
Freq
uenc
y
Time-to-Failure
MTBF
Freq
uenc
y
Time-to-Failure
MTBF
MTBF
140140140© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Constant Failure RateThe Exponential Distribution
• The exponential function:f(t) = λe-λt, t > 0
Failure rate, λ , IS CONSTANT
λ = 1/MTBF
• If MTBF = 2,500 hrs., what is the failure rate?
• λ = 1/2500 = 0.0004 failures/hr.
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
141141141© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
The “Bathtub” Curve
Time
Failu
re R
ate
Wear-Out“Useful Life” PeriodInfant Mortality
DECREASING Failure Rate
CONSTANT Failure Rate
INCREASING Failure Rate
142142142© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
The Exponential Reliability Formula
• Commonly used for electronic equipment
• The exponential reliability formula:
R(t) = e-λt or R(t) = e-t/MTBF
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
143143143© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Calculating Reliability
• A certain Cisco router has an MTBF of 100,000 hrs; what is the annual reliability?
Annual reliability is the reliability for one year or 8,760 hrs
R =e-(8760/100000) = 91.6%
• This says that the probability of no failure in one year is 91.6%; or, 91.6% of all units will surviveone year
ADDITIONAL TROUBLE TICKETING SLIDES
144© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
145145145© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Essential Data Elements
Description of Action Taken to Fix the ProblemStringResolution
Identity if the Event Was Due to Planned Maintenance Activity or Unplanned OutagePlanned/UnplannedType
For HW Problems include Product ID; for SW Include Release VersionAlphanumericComponent/Part/SW
Version
HW, SW, Process, Environmental, etc.StringRoot Cause
Outline of the ProblemStringProblem Description
Number of Customers that Lost Service; Number Impacted or Names of Customers ImpactedIntergerCustomers Impacted
Time of Resolutionhh:mmResolution Time
Date of Resolutiondd/mmm/yyResolution Date
Time of Faulthh:mmStart Time
Date of Faultdd/mmm/yyStart Date
Trouble Ticket NumberAlphanumericTicket
Date Ticket Issueddd/mmm/yyDateDescriptionFormatParameter
Note: Above Is the Minimum Data Set, However, if Other Information Is Captured it Should Be Provided
146146146© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Trouble Tickets• Definitions• Data accuracy • Collection
processes
Operational Process and Procedures
AnalysisData Analysis
HA Metrics/NAIS Synergy
• Network reliability improvement analysis
• Problem management• Fault management• Resiliency assessment• Change management• Performance
management• Availability
management
• Baseline availability• Determine DPM
(Defects Per Million) by:
Planned/UnplannedRoot CauseResolutionEquipment
• MTTR
Analyzed Trouble Ticket DataReferral for Process/Procedural Improvement
Referral for Analysis
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
ADDITIONAL SA AGENT SLIDES
147© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
148148148© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
SA Agent: How It Works
1. User configures Collectors through Mgmt Application GUI
2. Mgmt Application provisions Source routers with Collectors
6. Application retrieves data from Source routers once an hour
7. Data is written to a database
8. Reports are generated
3. Source router measures and stores performance data, e.g.:
Response time
Availability
4. Source router evaluates SLAs, sends SNMP Traps
5. Source router stores latest data point and 2 hours of aggregated points
SNMP
Management Application SA Agent
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
149149149© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
SAA Monitoring IP Core
R1
R3
R2
IP CoreIP Core
P1
P2
P3
Management System
150150150© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Monitoring Customer IP Reachability
P1-Pn Service Assurance Agent ICMP Polls to a Test Point in the IP Core
TP1TP1
TPxTPx
P1
P3
P2
PN
Nw1
Nw3
Nw3
NwN
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
151151151© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Service Assurance Agent Features
• Measures Service Level Agreement (SLA) metricsPacket Loss Response time Throughput
Availability Jitter
• Evaluates SLAs
• Proactively sends notification of SLA violations
152152152© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
SA Agent Impact on Devices
• Low impact on CPU utilization
• 18k memory per SA agent
• SAA rtr low-memory
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
153153153© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Monitored Network Availability Calculation
• Not calculated:Already have availability baselineFault type, frequency and downtime may be more useful
Faults directly measured from management system(s)
154154154© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Monitored Network Availability Assumptions
• All connections below IP are fixed
• Management systems can be notified of all fixed connection state changes
• All (L2) events impact on IP (L3) service
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
ADDITIONAL COOL SLIDES
155© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
156156156© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
CLIs
[no] cool run <cr>
[no] cool interface interface-name(idb) <cr>[no] cool physical-FRU-entity entity-index (int) <cr>
[no] cool group-interface group-objectID(string) <cr>[no] cool add-cpu objectID threshold duration <cr>
[no] cool remote-device dest-IP(paddr) obj-descr(string) rate(int) repeat(int) [local-ip(paddr) mode(int) ]<cr>
[no] cool if-filter group-objectID (string)<cr>
Configuration CLI Commands
Router#show cool event-table [<number of entries>] displays all if not specified
Router#show cool object-table [<object-type(int)>] displays all object types if not specified Router#show cool fru-entity
Display CLI Commands
Router#clear cool event-table
Router#clear cool persistent-files
Exec CLI Commands
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
157157157© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Measurement Example:Router Device Outage
Reload (Operational) , Power Outage, or Device H/W failure
Type: interface(1), physicalEntity(2), Process(3), and remoteObject(4). Index: the corresponding MIB table index. If it is PhysicalEntity(2), index in the ENTITY-MIB. Status: Up (1) Down (2).Last-change: last object status change time.AOT: Accumulated Outage Time (sec).NAF: Number of Accumulated Failure.
158158158© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Measurement Example: Cisco IOS S/W Outage Standby RP in Slot 0 Crash Using “Address Error (4) Test Crash”;AdEL Exception It Is Caused Purely by Cisco IOS S/W
Standby RP Crash Using “Jump to Zero (5) Test Crash”;Bp Exception It Can Be Caused by S/W, H/W, or Operation
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
159159159© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Measurement Example: Linecard Outage
Add a Linecard
Reset the Linecard
Down Event Captured Up Event Captured
AOT and NAF Updated
160160160© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Measurement Example: Interface Outage
12406-R1202(config)#cool group-interface ATM2/0.12406-R1202(config)#no cool group-interface ATM2/0.3
sh cool object 1 | include ATM2/0.33 1 1054859087 0 0 0 ATM2/0.135 1 1054859088 0 0 0 ATM2/0.239 1 1054859090 0 0 0 ATM2/0.441 1 1054859090 0 0 0 ATM2/0.5
12406-R1202(config)#interface ATM2/012406-R1202(config-if)#shutshow cool event-table**** COOL Event Table ****type index event time-stamp interval hist_id object-name1 33 1 1054859105 18 1 ATM2/0.11 35 1 1054859106 18 2 ATM2/0.21 39 1 1054859107 17 3 ATM2/0.41 41 1 1054859108 18 4 ATM2/0.5
12406-R1202(config)#interface ATM2/012406-R1202(config-if)#no shutshow cool event-table**** COOL Event Table ****type index event time-stamp interval hist_id object-name1 33 0 1054859146 41 1 ATM2/0.11 35 0 1054859147 41 2 ATM2/0.21 39 0 1054859149 42 3 ATM2/0.41 41 0 1054859150 42 4 ATM2/0.5
sh cool object 1 | include ATM2/0.33 1 1054859087 0 41 1 ATM2/0.135 1 1054859088 0 41 1 ATM2/0.239 1 1054859090 0 42 1 ATM2/0.441 1 1054859090 0 42 1 ATM2/0.5
Configure to Monitor All the Interfaces which Includes ATM2/0; String, Except ATM2/0.3
1
2 3
4 5
Object Table
Shut ATM2.0 Interface Down
Down Event Captured
Up Event Captured
No Shut ATM2.0 Interface
Object Table Shows AOT and NAF
© 2004 Cisco Systems, Inc. All rights reserved. Printed in USA.Presentation_ID.scr
161161161© 2004 Cisco Systems, Inc. All rights reserved.NMS-22019627_05_2004_c2
Measurement Example:Remote Device Outage
12406-R1202(config)#cool remote-device 1 50.1.1.2 remobj.1 30 2 50.1.1.1 112406-R1202(config)#cool remote-device 2 50.1.2.2 remobj.2 30 2 50.1.2.1 112406-R1202(config)#cool remote-device 3 50.1.3.2 remobj.3 30 2 50.1.3.1 1
sh cool object-table 4 | include remobj1 1 1054867061 0 0 remobj.12 1 1054867063 0 0 remobj.23 1 1054867065 0 0 remobj.3
12406-R1202(config)#interface ATM2/012406-R1202(config-if)#shut
12406-R1202(config)#interface ATM2/012406-R1202(config-if)#no shut
4 2 5 1054867105 42 2 remobj.24 1 5 1054867108 47 3 remobj.14 3 5 1054867130 65 10 remobj.3
4 1 4 1054867171 63 1 remobj.14 3 4 1054867193 63 8 remobj.34 2 4 1054867200 95 10 remobj.2
sh cool object-table 4 | include remobj1 1 1054867061 63 1 remobj.12 1 1054867063 63 1 remobj.23 1 1054867065 95 1 remobj.3
3 Remote Devices Are Added
Object Table
Shut Down the Interface Link Between the Remote Device and Router
Down Event Captured
Up Event Captured
Object Table Shows AOT and NAF
No Shut the Interface Link