Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.

20
Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008

Transcript of Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.

Page 1: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.

Nagios on Tier1 farm

Jonathan WheelerRAL Tier1 Fabric Team

20th June 2008

Page 2: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.

Overview

• What we had before (Sure)• Introduction to Nagios and how it is

configured for the farm• What might we do next

Page 3: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.

Sure monitoring - 1

• Consists of a server and clients• Communication via sysreq

command• Required scripts set up for each

client to run checks and report results to server

Page 4: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.

Sure monitoring - 2

3 main tasks:a) check host alive

• active using ping• passive accepting heartbeat messages

b) receive alarm messagesc) receive “backup started” and

“backup finished” messages

Page 5: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.

Sure monitoring - 3

Problems:• configuration not directly under Tier1

control• requires locally-written and locally

maintained scripts• limited view of farm alarms and state• alarms only visible on server screen

Page 6: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.

Introduction to Nagios

• highly configurable• under active development (Nagios 2.11

legacy, Nagios 3.0.2 latest stable)• active user community (mailing list)• some commercial offerings• extensive documentation part of

installation• allows local extensions

Page 7: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.

Introduction to Nagios – basics -1

Nagios:• schedules test commands, for

example: is space used in /var filesystem larger than permitted limit

• accepts results as return code (0 - OK, 1 – warning, 2 – critical, 3/-1 – unknown), and a single line message

Page 8: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.

Introduction to Nagios – basics -2

Nagios (continued):• displays via Web interface to

authorised users • sends notification via e-mail, SMS,

RSS, Morse code, jungle drums etc• may run an event handler, e.g. if a

test fails, then put this batch node offline

Page 9: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.

Introduction to Nagios – networked clients

• Nagios server can use check_nrpe command to run test on networked client

• client must be running nrpe client process to

– accept and run check requests– accept results and return to server

• Nagios server can also use ssh or smtp to perform checks (little experience on Tier1)

Page 10: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.

Nagios server

Nagiosclient

Nagiosclient

Nagiosclient

Nagiosclient

Single server, many clients

Page 11: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.

Introduction to Nagios – slave servers

• Running scheduled checks and web server puts heavy load on Nagios server

• Tier1 uses master and slave servers:– master keeps all results, runs web server

and sends notifications– slaves schedule tests, run them and

return results to master (using send_nsca command to nsca daemon)

Page 12: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.

Introduction to Nagios – “freshness”

If slave server has crashed:• master server checks whether tests

have been run to schedule (freshness checking)

• if test is stale (test results not returned to schedule), master will run test (force check)

Page 13: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.

Master and slaves servers; many clients

Master server

Slave server Slave server Slave server

Client

Client

Client Client Client

Client

Client Client

Client

Page 14: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.

Introduction to Nagios – clearing alarms

If check condition has been corrected and

you want to clear alarm before the nextscheduled test:• can force check (from master or slave)

by issuing appropriate formatted command to server

• scripts available to do this

Page 15: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.

Introduction to Nagios - configuration

In our configuration Nagios knows about:– hosts– host groups– services (for checking)– contacts and contact groups– time periods (when tests are valid, when

to send contact messages)

Page 16: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.

Introduction to Nagios - configuration

• Configuration is made simpler by extensive use of templates, for example:– define a template for a generic host– use it to define many other hosts, only

changing parameters that are different (e.g. host name, address, group to which it belongs)

– can be recursive

Page 17: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.

# Generic host definition templatedefine host{

name generic-host; name of host templatenotifications_enabled 1; Host notifications are enabledevent_handler_enabled 1; Host event handler is enabledflap_detection_enabled 1; Flap detection is enabledprocess_perf_data 1; Process performance dataretain_status_information 1; Retain status information retain_nonstatus_information 1; Retain non-status information register 0; Template definitioncheck_command check-host-alivemax_check_attempts 10notification_interval 720notification_period 24x7notification_options d,u,r

}

Page 18: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.

define host{use generic-hosthost_name ganglia0430parents swt-5530-0alias Ganglia Hosthostgroups aux-servicescontact_groups thorneaddress 130.246.183.173

}

define host{use generic-hosthost_name shelobparents swt-4400-1alias CSF Webserver

……………

Page 19: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.

Introduction to Nagios - plugins

• Test scripts are known as plugins• Can be written in any suitable

language: shell script, Perl, C, Pascal• About 60 standard plugins (available

by RPM from Dag Wieers’ repository)• About 30+ locally written plugins• plus 14+ specially written for Castor

Page 20: Nagios on Tier1 farm Jonathan Wheeler RAL Tier1 Fabric Team 20 th June 2008.

Nagios links

• Nagios home page: http://www.nagios.org/

• For locally written plugins: http://cvs.gridpp.rl.ac.uk/viewcvs/viewcvs.cgi/nagios/plugins/

• For GridPP information about Nagios: http://www.gridpp.ac.uk/wiki/Nagios