Post on 03-Jan-2016
1
Evaluation of Network Management Evaluation of Network Management Systems (NMS)Systems (NMS)
• Background• Problem Statement• Resolution• Evaluation of NMS solutions• Recommendation• Tasks Accomplished• Tasks Assigned
Rahul Datta,
ISIS, 09/10/06 Graduate Student, Vanderbilt University
2
Background Background The Fermi National Accelerator Laboratory ( Fermi labs), is
undergoing a research in lattice QCD( quantum chromodynamics).For this purpose they operate large clusters of computers. Their goal is the understanding of the strong dynamics of quarks and gluons, which is beyond the reach of the traditional perturbative methods of quantum field theory. A central goal of the groups using the computers is the accomplishment of the calculations required to extract from experiment the fundamental parameters of the Standard Model of particle physics. The Fermi labs is focusing on building a Cluster Reliability Subsystem. The LQCD computer cluster will be very large and will need to be available 24 hours a day. The cluster should insure that resources are used to best possible extent and attempt to complete started tasks in the presence of hardware and software failures (be fault resilient).
3
Background contd..Background contd..Examples of things that can affect availability and
performance include• power outages - scheduled and unscheduled• job failures due to failing or failed hardware• scheduling jobs on faulty nodes• decreased performance due to hardware
deterioration• decreased performance due to external
influences (e.g. air quality)• inability to diagnose problems (e.g. hardware, OS,
batch tools)
4
Problem StatementProblem Statement• To determine and specify the requirements placed on an NMS by
LQCD-like systems.• To survey available Network management systems (NMS) and
select a limited number capable of meeting the requirements to monitor/manage the computer cluster and the devices contained in that network. – To measure the performance of the NMS.– To ascertain the characteristics and features of the NMS.
• Prototype a limited-scale monitoring/adaptation system.– To monitor the ( utilization, state) of all processors and networks
in the system.• To experiment with it and observe what kind of plug-ins or
modification can be made to the NMS • To consider a system where pluggable components hook into a
message distribution system for routing and delivery to other pluggable components
5
Goal ArchitectureGoal Architecture
Coordinator
ArchiversIB FabricMonitor
IPMIMonitor
EmailMonitor
AlarmPresenters
EmailSenders
DcacheMonitor
IP NetworkMonitor
Phys AttrMonitor
User Proc Monitor
ServiceMonitor
Disk Monitor
Help Ticket Monitor
Job Scanner
Job Checker
PBS
qstat
Database
Acct Log
Maui
Head Node Functions – Final System
Action Takers
To/From Subordinates
BookeepingDatabase
6
Goal Architecture contd..Goal Architecture contd..
Coordinator
IB HCAMonitor IPMI
MonitorIP Network
Monitor
Phys AttrMonitor
User Proc Monitor
ServiceMonitor
Storage Monitor
PBS Monitor
Job Resource Monitor
Worker Functions – Final System
Action Takers
To/From Manager
Bookeeping Database
Job Activity Monitor
Job Class/Profile Monitor
Driver Monitor CPU state
MonitorUptime Monitor
Restart services,Report success/fail,Recycle drivers,Reboot machine
Activity timing,Running, staging,etc.
7
ResolutionResolution• Open Source, Not restricted, (Distribution ,
porting , licensing)• Tools for user Interface• Kind of communications available.• Heavy weight package or Light weight package.
( Resource requirements, Memory, processor, bandwidth)
• Synchronization and triggers , Memory check.• Plug ins available or modules can be build ( for
ex. Sensor modules)• Effectors, sensors and monitors. • Documentation
8
Potential NMS solutions Potential NMS solutions • Open NMS• PIKT• JFFNMS• Nagios• Aware• Net-Policy• SYSMON
Note : All the Network Management systems discussed here are Open Source.
• Due to the scope of the research done as of now Net-Policy and SYSMON has not been discussed in details here.
9
Open NMS( Open Network Management System )
Platform supported : Linux ,Fermi Linux, Cent OS, RHEL 3 & 4, Debian Sarge, SuSE, Red Hat Linux, Mandrake, Solaris, Mac OS( panther).
Features :
GUI ( web based graphical user interface)
Service polling:
o OpenNMS provides real-time event-driven systems. Events are typically from SNMP traps, but can come from other sources such as syslog. There is no polling interval as such in these systems. If a node goes down, an SNMP trap is generated by the switch immediately. true real-time network monitoring OpenNMS has the ability to poll the following services (ICMP ,NotesHTTP, DominoHTTP ,Citrix ,LDAP ,SNMP ,SNMPv2 ,and many more…. )
Network discovery
Availability Reporting
10
Open NMS( contd…)
SNMP Data Collection
SNMP Trap receiver (Over 5000 traps are pre-configured)
Notification via e-mail, pager, xmpp, growl, or anything that can be run on a command line
Supported Communications : Alarms, Sensors, Effectors
Threshold (based on data collected via SNMP or response time from a poller )
Well documented
Language written in : JAVA
11
PIKT ( Problem Informant Killer Tool)
Platform supported : GNU/ Linux ,Fermi Linux, AIX. FreeBSD , OpenBSD, Digital UNIX
Features :
Lacks proper GUI
Reporting a problem
Fixing a problem (Kill idle user sessions, monitoring user activity, delete junk files, disk management)
Scanning a log file ( log file analysis)
Configuring a system ( network configuration)
Auto-configuring a file( automated configuration management)
12
PIKT
Features (contd…)
Job scheduling (centrally directed scheduling daemon, cron alternative)
Monitoring system security (checksum differences, change auditing)
Enhancing the command line (command line macros, remote command execution)
Lacks proper documentation
13
JFFNMS (Just For Fun Network Management Systems)
Platform supported : GNU/ Linux ,Fermi Linux, AIX. FreeBSD , OpenBSD, Digital UNIX Features :
Web GUI
Event console, Shows event , Alarms in the same time ordered display
Distributed Polling
Triggers/Actions Framework for email/other clients
Map and sub-Map support
Completely administrative via web. Sound alerts in the browser
Database Abstraction Framework
Object oriented
Sensors
14
JFFNMS ( contd…….)
Reports
• Traffic bytes
• Utilization %
• Packets per second, errors per second, error rate
• Round Trip Time and Packet loss ( CISCO and Smokeping)
• Drops
• TCP connections: Incoming, Outgoing, Established, Delay
• Number of processors, Number of users
• Used memory and Disks with aggregation
• Processor utilization and Load average
• Temperature
• Documentation available
Language written in : PHP
15
NAGIOS
Platform supported : Linux ,Fermi Linux
Features :
Monitoring of network services( SMTP, POP3,HTTP,etc) Ability to define network host hierarchy, allowing detection and distinction of hosts that
are down and those that are unreachable
Notifications via email , pager or other user defined method
Ability to define event handlers to be run during service or host events for proactive service resolution
Ability to acknowledge problems via the web interface
Supported Communications
o Simple plugin design allowing users to develop their own host and service checks
16
NAGIOS (contd..)
Supported Communications (contd….)
o Simple plugin design allowing users to develop their own host and service checks
o Monitoring of Host resources( processor load, disk and memory usage, running processes, log files, etc)
o Monitoring of environmental factors such as temperature
Language written in : C
17
AWARE
Platform supported : Linux ,Fermi Linux
Features :
Open source implementation allows for robust code base and customization
Common core engine implements a model of event processing
A "plug in" style mechanism allows dynamic addition of handlers
Agents are composed of a set of running event handlers
Agents can get their configuration from other agents (e.g., a centrally managed set of agent configurations)
Agents can communicate with other agents using connection oriented, connectionless and broadcast based methods
18
AWARE
Features (contd..)
Supported Communications:o Sensors: A comprehensive set of sensors that gather relevant information
o Analyzers: Components that process data from the sensors and issue controller commands
o Controllers: Components that change system state (e.g., run programs, change system parameters, control devices
Documentation Available.
Language written in : C
19
Comparison of the features of the different NMSComparison of the features of the different NMS
Tool Name GUI and Status Reports
Service Polling ,Network Discovery
Alarms, Sensors, Effectors
Memory ,Processor, Bandwidth
OpenNMS * * * * * * * * * * * * * * * TBD
PIKT * * * * * * * * TBD
JFFNMS * * * * * * * * * * * TBD
Nagios * * * * * * * * * * * TBD
Aware * * * * * * * * * * * * * TBD
20
RecommendationRecommendation• Explore and experiment with the full
features of at least 2 or 3 Open Source Network Management Systems (NMS) before finalizing a NMS.
• Based on the comparative features OpenNMS has been chosen.
21
Tasks AccomplishedTasks Accomplished• Installation of OpenNMS successfully
on an offsite Fermi Linux machine at ISIS, Vanderbilt University.
22
Tasks assignedTasks assigned• Exploring the features of OpneNMS
for example :To find a sensor and installing the
sensor, building it. Writing our own sensors, alarms, effectors.
Detect the temperature difference of the hard drive of at least one of the nodes using OpenNMS.
23
Useful Links /URLSUseful Links /URLS
• http://www.openxtra.co.uk/resource-center/open_source_network_management_systems.php
• http://www.opennms.org/index.php/Main_Page
• http://jffnms.sourceforge.net/
• http://www.elegant-software.com/software/aware/doc/html/index.html