Lessons learned from an HP Network Automation and Network Node Manager i integrated deployment with...

©2010 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice

Lessons learned from an HP Network Automation and Network Node Manager I integrated deployment with TelAlert notification in an MPLS environmentBill P. Fanelli, Principal ArchitectAllen Corporation of America

Allen Corporation of America, Inc.

• Headquarters: Fairfax, VA

• Organization

— Training Systems Division

— Integrated Technologies Division

— CyberSecurity Division

— Logistics Services Division

• Regional Offices: Colonial Heights, VA; Ithaca, NY;

Myrtle Beach, SC; The Hague, Netherlands

• Sites in 22 States, with Worldwide Operations

• 250+ employees

• Private Corporation - Small business

• Secret Facilities Clearance

2

Complete Life-Cycle

Support

Security

Management

Enterprise

Notification

Solutions

Cyber Security Division

3

Cyber Security, Enterprise Management Services

Agenda

4

• Integrating NA with NNMi

– Benefits of integration

– Implementation Tips

• Monitoring MPLS with NNMi

– Issues with virtual networks

– How to best match the map to your environment

• Stabilizing staffing using Notification with TelAlert

– Taming the workload with automation

Case Study

5

• Large Energy company

– Diverse network – includes radio transmission towers and SCADA

devices

– Growth by acquisition – reserves grew by a factor of 50 over 15

years

• Issues in IT

– Assimilation of acquired infrastructure

• NNMi & NA

– MTTR for field outages was 2 ½ - 3 days

• NA

– Network staff could not grow linearly with company

• Reserves doubled every four months

• NNMi on MPLS

• TelAlert

NA and NNMi Selection Drivers

6

• See what is running

• Assimilate acquired infrastructure

– Technology

• Cisco

– Process

• Standardize configurations with NA

• Centralize monitoring with NNMi

– People

• Automated notification from NNMi to TelAlert

Let’s Get Started

7




Benefits of Integrated

NA/NNMi Process

8

• High percentage of outages due to changes

– Coordinate changes

– Ability to roll back changes, both authorized and

unauthorized

• Standardize and Automate

– SNMP community string change

• Add new string

• Confirm all nodes are configured and working

• Remove old string

• Expedite Field Replacements

– Drop ship replacement devices to field location

– Push configuration over the wire

Features of NA/NNMi Integration

9

• GUI integration

– Cross launch with context

– Telnet or SSH access to

devices

– Bring NA diagnostics to

NNMi

• Data integration

– Import NNMi devices into

NA

– Secret Ingredient

• NA must have NNMi Node

UUID to make the match

Linking NA with NNMi

10

• Run the Connector installer on the NA machine

– Connects to NNMi and installs components there as well

• Dependence on whether NA and NNMi are co-resident

– Some default ports are the same

• Install NNMi first, then NA installer will accommodate

– Separate Connector installers as well

• Learn from us

– Initially co-resident and then moved NA

– Many extra steps involved

• Not worth a ―try and see‖ approach

– Think your way through impact of co-residency

• NNMi has huge memory requirement

Import NNMi Devices to NA

11

• On NNMi, run nnmimport

• Queries NA for a list of supported OIDs

• Dumps nodes from NNMi database

matching supported OIDs only

• Pushes node information – particularly the

NNMi Node UUID – to NA

• Wanted All Devices from NNMi to NA

– Even Unsupported

Adding Devices from NNMi to NA

• On the NA server, add the OIDs to{NA_DIR}/jre/adjustable_options.rcx

• Format<array name="drivers/custom_sysoids">

<value>  </value>

<value>  </value>

<value>  </value>

</array>

• For example<array name="drivers/custom_sysoids">

<value>1.3.6.1.4.1.9.1.479</value>

</array>

• Save and restart NAS

Finding Supported OIDs in NA

• telnet or ssh to NA box

• Login as an NA User

• Run the commandlist sys oids all

• All OIDs supported by NA will be listed

13

Finding OIDs in Use in NNMi

• On the NNMi server, run the commandnnmtopodump.ovpl -legacy long -type node

pipe this tofind "SNMP OBJECT ID: " or

grep "SNMP OBJECT ID:"

and redirect to a file, such asOIDs_in_use.out

• nnmtopodump.ovpl -legacy long -type node | find "SNMP OBJECT ID:" > OIDs_in_use.out

14

Determine OIDs to Add to NA

• Sort, cut and compare these two lists

• Generate a list of OIDs

– from the NNMi ―OIDs in use‖ list

– that are not in the NA ―supported OIDs‖ list

• Add these to the adjustable_options.rcx file

• The next time nnmimport is run on the NNM box

– NA will respond that the added OIDs are supported

– therefore nnmimport will include them in the push to NA

• Warning

– nnmimport has the tendency to create duplicate entries in NA

– This is not due to modifying adjustable_options.rcx

– Use nnmimport carefully until you understand the impact on NA

in your environment

15

Restart NAS You Say…

16

Where Are We

17




Monitoring MPLS with NNMi

• Discovery across virtual boundaries is inherently difficult

– Contiguous map

– Downstream suppression

18

Contiguous Map

• NNMi has Subnet Connection Rules

• NNMi can create Layer 2 Connections for subnets at the

edge of subnetworks that are directly connected via Wide

Area Networks (WANs).

• Define rules to control which subnets and interfaces NNMi

uses to create additional Layer 2 connections.

19

Small Subnets Rule

• All rules are on by default

20

Discovery Islands

21

Discovery Islands

• Good – not perfect

• Remember that we do not manage large networks by

Maps

– Manage by events

• Topology that NNMi knows about that is represented by

these maps is most important

• Status representation on maps is also important

– Maintain user confidence

• Issue with map status display with MPLS connected sites

– Downstream suppression rule prevents nodes and containers from

representing MPLS outage

22

Downstream Suppression:

The Situation

• NNMi analyses the Layer 2 information and determines

when a set of nodes are not connected at layer 2 as far as

it can discover.

• This applies to MPLS connected sites

• NNMi puts these nodes into NNMi defined node groups

named Island nnnn, where nnnn is a unique number for

each set of layer 2 connected nodes that are not

connected to the NNMi server.

• When an island is isolated by an MPLS failure, all the

nodes in the island are put into a warning or unknown

state.

23

Downstream Suppression—The Fix

• If a node is added to the Important Nodes node group and

it goes down or becomes isolated, it will be set to critical

status. This overrides the island logic which sets it to

warning or unknown.

• Added filter rules to the Important Nodes node group on

NNMi server as follows:

– Device Filters

• Device = Gateway or Router

– Additional Filters

• Island = not null

• Automatically populates the Important Nodes node group

with the routers in the islands

24

Downstream Suppression—Outcome

• When MPLS site is isolated

– All routers go critical

• Could be further filtered

• NNMi does produce a Critical Event

– Without adding nodes to Important Nodes

Node Group, the node and containers do not

reflect outage

25

Home Stretch

26




The Case for Notification

27

• Text or Text-to-Speech messaging has lower

barrier to entry since almost everyone now carries

a cell phone

• Normal Hours

– Get someone’s attention at their desk or away from it

• Off Hours

– Staffing for 7 x 24 monitoring is cost prohibitive for

most organizations

• Rule of 13/8

– Need for 7 x 24 monitoring is growing as companies

become more network dependent

Desired Workflow

28

• Immediate Notification

– Core network team only

– SNMP IFdown Trap

• Root Cause Event

– District and Site where event occurred

– Could be:

• Node Down

• Remote site containing node is unreachable

• Node or Connection Down

• Interface Down

– Typically delayed three minutes

• Reminder on open incidents

– Core network team after one hour

NNMi Actions

29

• Trigger on Lifecycle States

– Registered, In Progress, Completed, and Closed

– Typically use Registered and Closed

• Large number of parameters for configuring incident

actions plus Custom Incident Attributes

– By pairing Lifecycle States, Message ID stays the same

– Node Down Registered is cleared by Node Down Closed

• Instead of separate Node Up event

• Effect in TelAlert

– When Registered

telalertc –g NetCore –m Node $sourceNodeName Down –ticket $id

– When Closed

telalertc –ack –ticket $id

Implemented Workflow

30

• Immediate Notification

– When SNMP Trap Incident enters Registered State

– Send message now to core network group

– telalertc -g NetCore -subject "$severity fault on

$sourceNodeName―

-m "Fault: $name on $sourceObjectName on node

$sourceNodeName at $lastOccurrenceTime―

• Notify Site and District

– When Root Cause Incident enters Registered State

– Send final message to core network, site and district groups

– telalertc -g NetAll -ticket $id -delay 3m -subject "$severity fault on

$sourceNodeName―-m "Fault: $name on $sourceObjectName on

node $sourceNodeName at $lastOccurrenceTime―

Implemented Workflow

31

• Reminder on open incidents

– When Root Cause Incident enters Registered State

– telalertc -g NetCore -delay 60m -ticket $id

-subject "Reminder message on $sourceNodeName―

-m "Reminder message on $sourceNodeName―

• Recovery

– When ―Down‖ Incident enters Closed State

– telalertc -ack -ticket $id

Typical Scenario

32

• Router loses power

• SNMP IFdown Trap from upstream router

– NNMi sends message to NetCore group for immediate delivery

• Causal engine posts Interface Down Root Cause Incident

– NNMi sends message to NetAll group with three minute delay

– NNMi sends reminder to NetCore group with one hour delay

• Causal engine posts Node or Connection Down Incident

– Interface Down Incident is closed

• NNMi sends –ack to clear Interface Down message and reminder

– NNMi sends message to NetAll group with three minute delay

– NNMi sends reminder to NetCore group with one hour delay

Typical Scenario

33

• Causal engine posts Node Down Incident

– Interface Down Incident is closed– NNMi sends –ack to clear Node or Connection Down message

and reminder

– NNMi sends message to NetAll group with three minute

delay

– NNMi sends reminder to NetCore group with one hour

delay

• Three minute delay timer expires

– Node Down message delivered to all groups

• One hour delay timer expires

– Reminder message delivered to NetCore group

Conclusion

34


– Consistency of configurations

– Same nodes in both tools


– Monitor by Incidents

– Map status should reflect real world status


– Demands on staff are growing faster than the staff

headcount

– Automation is the key to survival

35

Allen Corporation

Allen Corporation of America, Inc.

10400 Eaton Place, Suite 450Fairfax, VA 22030

(866) HQ - ALLEN (866) 472-5536

www.allencorp.com

Bill [email protected]

571.321.1648 Voice

mailto:[email protected]

Questions or Comments?

*******

Thank you for your time

37 ©2010 Hewlett-Packard Development Company, L.P.

To learn more on this topic, and to connect with your peers after

the conference, visit the HP Software Solutions Community:

www.hp.com/go/swcommunity

Lessons learned from an HP Network Automation and Network Node Manager i integrated deployment with...

Documents

Transcript of Lessons learned from an HP Network Automation and Network Node Manager i integrated deployment with...