Staying Sane with Nagios

42
Staying Sane with Nagios Matt Simmons @standaloneSA [email protected] http://www.standalone-sysadmin.com

description

From an invited talk I did at PICC-10 (now known as LOPSA-East) about how to manage a Nagios installation without pulling your hair out. In the ensuing years, I've automated more, but still have the same kind of mindset about inheritance and so on.

Transcript of Staying Sane with Nagios

Page 1: Staying Sane with Nagios

Staying Sane with Nagios

Matt Simmons

@standaloneSA

[email protected]

http://www.standalone-sysadmin.com

Page 2: Staying Sane with Nagios

Introduction & Outline

Confessions:

Global Sanity Small & Medium Shops Large Scale Shops Add Ons Warnings Additional Resources

I am not actually a Nagios Expert I do actually LIKE NagiosOutline:

Page 3: Staying Sane with Nagios

I know what you're thinking...

Nagios?

Sane???

Unlikely!!!

Serenity Now!!!

Page 4: Staying Sane with Nagios

Nagios? SANE?!?

Serenity Now!!!

Page 5: Staying Sane with Nagios

Global Sanity

Universal Advice Affects installations of all sizes

Documentation Centralized Authentication Plugin Development

Page 6: Staying Sane with Nagios

Global Sanity: Documentation

Read the documentation Object Definitions

http://nagios.sourceforge.net/docs/3_0/objectdefinitions.html Use 3_0 when searching Bookmark the good ones Nagiosbook.org will be soon coming out with 3.x docs

http://www.nagiosbook.org/

Page 7: Staying Sane with Nagios

Global Sanity: Central Auth

Centralized Authentication LDAP / AD with Apache

(I use Likewise Open) Domain users -> Nagios Contacts

[email protected] Access to CGI interface

Page 8: Staying Sane with Nagios

Global Sanity: Do Not Reinvent the Wheel...

Nagios Exchange http://exchange.nagios.org/ Pros:

Nearly 2000 Listings >1600 plugins

Cons: Varying quality and reliability Old, unmaintained, code rot, etc

Page 9: Staying Sane with Nagios

Global Sanity: ...unless you have to

Writing your own Nagios Plugins Great guide

http://nagiosplug.sourceforge.net/developer-guidelines.html Extended Output Huge Community Any language you want

Page 10: Staying Sane with Nagios

Small & Medium Shops

Not exclusively small or medium, just a non-automatic way of doing things

For people who: Manually edit / create entries in config files Don't use extensive 3rd party management software Have a small team of responsible admins Don't require large distributed monitoring networks

Page 11: Staying Sane with Nagios

Configuration Sanity

When: Creating new configs Working with existing configs Testing Responding to events

Page 12: Staying Sane with Nagios

Syntax Highlighting

This?

Page 13: Staying Sane with Nagios

Syntax Highlighting

Or this?

Page 14: Staying Sane with Nagios

Config File Hierarchy

Default config is stupid. cfg_dir directive is key

*.cfg – recursively

Hierarchy should resemble “real life” Allows for additional “group” security Use what makes sense to you and document it

Page 15: Staying Sane with Nagios

Config File Hierarchy: Example

Output of “tree -d” on my Nagios objects directory

|-- commands |-- computers | |-- groups | |-- linux | | `-- services | `-- windows |-- misc `-- network |-- firewalls |-- links |-- routers `-- switches

Page 16: Staying Sane with Nagios

Regular Expressions

Not all regexes are created equal use_regexp_matching

Only when object names contain: * ?

use_true_regexp_matching 'man regex' All object names Caution: Unintended consequences

Page 17: Staying Sane with Nagios

Better Object Formatting

This?

Page 18: Staying Sane with Nagios

Better Object Formatting

Or this?

Page 19: Staying Sane with Nagios

Revision Control

CVS/SVN/git(?) Simple, maintainable, recoverable Self-documenting (if done correctly)

Page 20: Staying Sane with Nagios

(ab)Use Inheritance

Templates register = 0

Multiple Inheritance Beware the spaghetti code

Page 21: Staying Sane with Nagios

Use Hostgroups

define service{

   service_description SSH Service Check

   check_command check_ssh

   host_name linux01, linux02, linux03, ... linux50

}

Page 22: Staying Sane with Nagios

Use Hostgroupsdefine hostgroup{

   hostgroup_name linux­servers

}

define host{

   use generic­host

   host_name linux01

   address 192.168.0.10

   hostgroups linux­servers

}

define service{

   service_description SSH service check

   check_command check_ssh

   hostgroup_name linux­servers

}

Page 23: Staying Sane with Nagios

Script / Automate

Automate as much as possible New Hosts New Services Commands

mkhost.sh as a template

Page 24: Staying Sane with Nagios

Use alternate contacts file when testing new features

Coworkers are under enough stress as it is No messy explanations Use symlinks to point to “real” contacts file

Page 25: Staying Sane with Nagios

Plugin Sanity

Thoughts about writing, configuring, and using Nagios plugins

Page 26: Staying Sane with Nagios

SNMP

Use it whenever possible. Really.

Page 27: Staying Sane with Nagios

NRPE vs check_by_ssh

Nagios Remote Plugin Executable(?) Skip it when possible

Use SNMP

NRPE

Page 28: Staying Sane with Nagios

When checking disk usage

Do not specify the partitions to check Instead, specify the partitions to NOT check Too easy to forget to add new partitions. If possible, use a plugin that produces statistics

for graphing usage trends

Page 29: Staying Sane with Nagios

Notification Sanity

Notifications suck. Here are some ways to make them

not suck as much.

Page 30: Staying Sane with Nagios

Alternate Communication Method

When the network Is down, email is down too Have a non-email contact method

SMS, cell modem, smoke signals Test it occasionally

Page 31: Staying Sane with Nagios

Use parents

Establish a path FROM THE NAGIOS SERVER Failure will trigger “unreachable” states

“u” notification flag

Only useful for non-local-subnet hosts typically If the local switch dies, alerts don't go out anyway

Typically

Page 32: Staying Sane with Nagios

Use Dependencies

Available for both hosts and services The disks didn't blow up, SNMP crashed What do you mean, the website is unavailable when

the database crashes

Dependencies != parents Parents establish a line between the host and

Nagios Dependencies establish logical object relationships

Page 33: Staying Sane with Nagios

Notifications are Commands

Use Them Execute what you need, when you need, where you

need through extra-nagios scripts

Your imagination is the limit Electrical relays? Flashing lights? HALON release?

Please don't.

Page 34: Staying Sane with Nagios

Use Passive Checks (when necessary / appropriate)

For “normal” passive checks, specify freshness checks

Useful for SNMP traps Combine with snmptrapd

Distributed Monitoring Use for capacity reasons Physical separation calls for separate Nagios

installs (in my opinion)

Page 35: Staying Sane with Nagios

Macros GOOD

60 bajillion available - http://nagios.sourceforge.net/docs/3_0/macrolist.html

On Demand Macros Specify “remote” macros from other hosts

$HOSTMACRO:SOMEHOST$

Custom Variable Macros _MACADDRESS00:01:02:03:04:05

$_HOSTMACADDRESS$

Available as environmental variables in scripts $NAGIOS_MACRONAME

Page 36: Staying Sane with Nagios

Use Flap Detection

Or not. Who wants a charged cellphone battery?

Measures state changes:

Weighted measure of the last 21 checks More recent counts higher

Page 37: Staying Sane with Nagios

Large Shops

Too many nodes to easily configure by hand, or too many nodes to deal with using one server

Scaling Nagios Centralized Management Web Configurators

Page 38: Staying Sane with Nagios

Scaling Nagios

large_installation_tweaks No summary macros, memory handling is different,

and processes fork() less

Distributed monitoring Assign groups of hosts to one Nagios server

(reporting via NSCA / Passive checks)

Check tuning docs: http://nagios.sourceforge.net/docs/3_0/tuning.html

Page 39: Staying Sane with Nagios

Centralized Management

Puppet / chef / cfengine / whatever Distribute nagios user's key if necessary Install nagios agents (NSCA / NRPE) Automate Configuration Build

Puppet's built-in Nagios types sound convenient...sort of

Page 40: Staying Sane with Nagios

Nagios Web Configuration

Dozen, If not hundreds I don't know of a great one. May be worth building or finding one that

matches your inventory system Don't double-up on data if you don't have to

Page 41: Staying Sane with Nagios

Malproductive Practices

Overreliance on Event Handlers Please don't do anything terribly important. Edge cases are scary.

Overabuse of inheritance Spaghetti code Hard to trace

Overcomplification Simple is nearly always better

Page 42: Staying Sane with Nagios

Learn More

Mailing List Nagios Users

https://lists.sourceforge.net/lists/listinfo/nagios-users

LinkedIn Nagios Users

http://www.linkedin.com/groupAnswers?viewQuestions=&gid=131532&forumID=3&sik=1272591931152