Lessons learned from an HP Network Automation and Network Node Manager i integrated deployment with...
-
Upload
hp-software-solutions -
Category
Documents
-
view
5.625 -
download
4
description
Transcript of Lessons learned from an HP Network Automation and Network Node Manager i integrated deployment with...
©2010 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice
Lessons learned from an HP Network Automation and Network Node Manager I integrated deployment with TelAlert notification in an MPLS environmentBill P. Fanelli, Principal ArchitectAllen Corporation of America
Allen Corporation of America, Inc.
• Headquarters: Fairfax, VA
• Organization
— Training Systems Division
— Integrated Technologies Division
— CyberSecurity Division
— Logistics Services Division
• Regional Offices: Colonial Heights, VA; Ithaca, NY;
Myrtle Beach, SC; The Hague, Netherlands
• Sites in 22 States, with Worldwide Operations
• 250+ employees
• Private Corporation - Small business
• Secret Facilities Clearance
2
Complete Life-Cycle
Support
Security
Management
Enterprise
Notification
Solutions
Cyber Security Division
3
Cyber Security, Enterprise Management Services
Agenda
4
• Integrating NA with NNMi
– Benefits of integration
– Implementation Tips
• Monitoring MPLS with NNMi
– Issues with virtual networks
– How to best match the map to your environment
• Stabilizing staffing using Notification with TelAlert
– Taming the workload with automation
Case Study
5
• Large Energy company
– Diverse network – includes radio transmission towers and SCADA
devices
– Growth by acquisition – reserves grew by a factor of 50 over 15
years
• Issues in IT
– Assimilation of acquired infrastructure
• NNMi & NA
– MTTR for field outages was 2 ½ - 3 days
• NA
– Network staff could not grow linearly with company
• Reserves doubled every four months
• NNMi on MPLS
• TelAlert
NA and NNMi Selection Drivers
6
• See what is running
• Assimilate acquired infrastructure
– Technology
• Cisco
– Process
• Standardize configurations with NA
• Centralize monitoring with NNMi
– People
• Automated notification from NNMi to TelAlert
Let’s Get Started
7
• Integrating NA with NNMi
• Monitoring MPLS with NNMi
• Stabilizing staffing using Notification with TelAlert
Benefits of Integrated
NA/NNMi Process
8
• High percentage of outages due to changes
– Coordinate changes
– Ability to roll back changes, both authorized and
unauthorized
• Standardize and Automate
– SNMP community string change
• Add new string
• Confirm all nodes are configured and working
• Remove old string
• Expedite Field Replacements
– Drop ship replacement devices to field location
– Push configuration over the wire
Features of NA/NNMi Integration
9
• GUI integration
– Cross launch with context
– Telnet or SSH access to
devices
– Bring NA diagnostics to
NNMi
• Data integration
– Import NNMi devices into
NA
– Secret Ingredient
• NA must have NNMi Node
UUID to make the match
Linking NA with NNMi
10
• Run the Connector installer on the NA machine
– Connects to NNMi and installs components there as well
• Dependence on whether NA and NNMi are co-resident
– Some default ports are the same
• Install NNMi first, then NA installer will accommodate
– Separate Connector installers as well
• Learn from us
– Initially co-resident and then moved NA
– Many extra steps involved
• Not worth a ―try and see‖ approach
– Think your way through impact of co-residency
• NNMi has huge memory requirement
Import NNMi Devices to NA
11
• On NNMi, run nnmimport
• Queries NA for a list of supported OIDs
• Dumps nodes from NNMi database
matching supported OIDs only
• Pushes node information – particularly the
NNMi Node UUID – to NA
• Wanted All Devices from NNMi to NA
– Even Unsupported
Adding Devices from NNMi to NA
• On the NA server, add the OIDs to{NA_DIR}/jre/adjustable_options.rcx
• Format<array name="drivers/custom_sysoids">
<value> <!-- sys oid --> </value>
<value> <!-- another sys oid --> </value>
<value> <!-- etc. --> </value>
</array>
• For example<array name="drivers/custom_sysoids">
<value>1.3.6.1.4.1.9.1.479</value>
</array>
• Save and restart NAS
Finding Supported OIDs in NA
• telnet or ssh to NA box
• Login as an NA User
• Run the commandlist sys oids all
• All OIDs supported by NA will be listed
13
Finding OIDs in Use in NNMi
• On the NNMi server, run the commandnnmtopodump.ovpl -legacy long -type node
pipe this tofind "SNMP OBJECT ID: " or
grep "SNMP OBJECT ID:"
and redirect to a file, such asOIDs_in_use.out
• nnmtopodump.ovpl -legacy long -type node | find "SNMP OBJECT ID:" > OIDs_in_use.out
14
Determine OIDs to Add to NA
• Sort, cut and compare these two lists
• Generate a list of OIDs
– from the NNMi ―OIDs in use‖ list
– that are not in the NA ―supported OIDs‖ list
• Add these to the adjustable_options.rcx file
• The next time nnmimport is run on the NNM box
– NA will respond that the added OIDs are supported
– therefore nnmimport will include them in the push to NA
• Warning
– nnmimport has the tendency to create duplicate entries in NA
– This is not due to modifying adjustable_options.rcx
– Use nnmimport carefully until you understand the impact on NA
in your environment
15
Restart NAS You Say…
16
Where Are We
17
• Integrating NA with NNMi
• Monitoring MPLS with NNMi
• Stabilizing staffing using Notification with TelAlert
Monitoring MPLS with NNMi
• Discovery across virtual boundaries is inherently difficult
– Contiguous map
– Downstream suppression
18
Contiguous Map
• NNMi has Subnet Connection Rules
• NNMi can create Layer 2 Connections for subnets at the
edge of subnetworks that are directly connected via Wide
Area Networks (WANs).
• Define rules to control which subnets and interfaces NNMi
uses to create additional Layer 2 connections.
19
Small Subnets Rule
• All rules are on by default
20
Discovery Islands
21
Discovery Islands
• Good – not perfect
• Remember that we do not manage large networks by
Maps
– Manage by events
• Topology that NNMi knows about that is represented by
these maps is most important
• Status representation on maps is also important
– Maintain user confidence
• Issue with map status display with MPLS connected sites
– Downstream suppression rule prevents nodes and containers from
representing MPLS outage
22
Downstream Suppression:
The Situation
• NNMi analyses the Layer 2 information and determines
when a set of nodes are not connected at layer 2 as far as
it can discover.
• This applies to MPLS connected sites
• NNMi puts these nodes into NNMi defined node groups
named Island nnnn, where nnnn is a unique number for
each set of layer 2 connected nodes that are not
connected to the NNMi server.
• When an island is isolated by an MPLS failure, all the
nodes in the island are put into a warning or unknown
state.
23
Downstream Suppression—The Fix
• If a node is added to the Important Nodes node group and
it goes down or becomes isolated, it will be set to critical
status. This overrides the island logic which sets it to
warning or unknown.
• Added filter rules to the Important Nodes node group on
NNMi server as follows:
– Device Filters
• Device = Gateway or Router
– Additional Filters
• Island = not null
• Automatically populates the Important Nodes node group
with the routers in the islands
24
Downstream Suppression—Outcome
• When MPLS site is isolated
– All routers go critical
• Could be further filtered
• NNMi does produce a Critical Event
– Without adding nodes to Important Nodes
Node Group, the node and containers do not
reflect outage
25
Home Stretch
26
• Integrating NA with NNMi
• Monitoring MPLS with NNMi
• Stabilizing staffing using Notification with TelAlert
The Case for Notification
27
• Text or Text-to-Speech messaging has lower
barrier to entry since almost everyone now carries
a cell phone
• Normal Hours
– Get someone’s attention at their desk or away from it
• Off Hours
– Staffing for 7 x 24 monitoring is cost prohibitive for
most organizations
• Rule of 13/8
– Need for 7 x 24 monitoring is growing as companies
become more network dependent
Desired Workflow
28
• Immediate Notification
– Core network team only
– SNMP IFdown Trap
• Root Cause Event
– District and Site where event occurred
– Could be:
• Node Down
• Remote site containing node is unreachable
• Node or Connection Down
• Interface Down
– Typically delayed three minutes
• Reminder on open incidents
– Core network team after one hour
NNMi Actions
29
• Trigger on Lifecycle States
– Registered, In Progress, Completed, and Closed
– Typically use Registered and Closed
• Large number of parameters for configuring incident
actions plus Custom Incident Attributes
– By pairing Lifecycle States, Message ID stays the same
– Node Down Registered is cleared by Node Down Closed
• Instead of separate Node Up event
• Effect in TelAlert
– When Registered
telalertc –g NetCore –m Node $sourceNodeName Down –ticket $id
– When Closed
telalertc –ack –ticket $id
Implemented Workflow
30
• Immediate Notification
– When SNMP Trap Incident enters Registered State
– Send message now to core network group
– telalertc -g NetCore -subject "$severity fault on
$sourceNodeName―
-m "Fault: $name on $sourceObjectName on node
$sourceNodeName at $lastOccurrenceTime―
• Notify Site and District
– When Root Cause Incident enters Registered State
– Send final message to core network, site and district groups
– telalertc -g NetAll -ticket $id -delay 3m -subject "$severity fault on
$sourceNodeName―-m "Fault: $name on $sourceObjectName on
node $sourceNodeName at $lastOccurrenceTime―
Implemented Workflow
31
• Reminder on open incidents
– When Root Cause Incident enters Registered State
– telalertc -g NetCore -delay 60m -ticket $id
-subject "Reminder message on $sourceNodeName―
-m "Reminder message on $sourceNodeName―
• Recovery
– When ―Down‖ Incident enters Closed State
– telalertc -ack -ticket $id
Typical Scenario
32
• Router loses power
• SNMP IFdown Trap from upstream router
– NNMi sends message to NetCore group for immediate delivery
• Causal engine posts Interface Down Root Cause Incident
– NNMi sends message to NetAll group with three minute delay
– NNMi sends reminder to NetCore group with one hour delay
• Causal engine posts Node or Connection Down Incident
– Interface Down Incident is closed
• NNMi sends –ack to clear Interface Down message and reminder
– NNMi sends message to NetAll group with three minute delay
– NNMi sends reminder to NetCore group with one hour delay
Typical Scenario
33
• Causal engine posts Node Down Incident
– Interface Down Incident is closed– NNMi sends –ack to clear Node or Connection Down message
and reminder
– NNMi sends message to NetAll group with three minute
delay
– NNMi sends reminder to NetCore group with one hour
delay
• Three minute delay timer expires
– Node Down message delivered to all groups
• One hour delay timer expires
– Reminder message delivered to NetCore group
Conclusion
34
• Integrating NA with NNMi
– Consistency of configurations
– Same nodes in both tools
• Monitoring MPLS with NNMi
– Monitor by Incidents
– Map status should reflect real world status
• Stabilizing staffing using Notification with TelAlert
– Demands on staff are growing faster than the staff
headcount
– Automation is the key to survival
35
Allen Corporation
Allen Corporation of America, Inc.
10400 Eaton Place, Suite 450Fairfax, VA 22030
(866) HQ - ALLEN (866) 472-5536
www.allencorp.com
Bill [email protected]
571.321.1648 Voice
Questions or Comments?
*******
Thank you for your time
37 ©2010 Hewlett-Packard Development Company, L.P.
To learn more on this topic, and to connect with your peers after
the conference, visit the HP Software Solutions Community:
www.hp.com/go/swcommunity