Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2 ...

26
Lessons in Nagios learnt from developing Opsview Ton Voon Altinity Limited September 2008 Copyright Ton Voon. Released under Creative Commons, Attribution-Noncommercial

Transcript of Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2 ...

Page 1: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

Lessons in Nagios learnt from developing OpsviewTon Voon

Altinity Limited

September 2008

Copyright Ton Voon. Released under Creative Commons, Attribution-Noncommercial

Page 2: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

Classified: Top secrets of Nagios in OpsviewTon Voon

Altinity Limited

September 2008

Copyright Ton Voon. Released under Creative Commons, Attribution-Noncommercial

Page 3: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

“Opsview”?

Our monitoring solution

Database back end

Web front end

Configuration and status

Distributed

Open source

Page 4: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

Why?

Obligation – GPLv2

http://trac.opsview.org/browser/trunk/opsview-base

Moral duty

Business benefit:

Support moves to core projects

Easier for us to upgrade base code

Advertise ourselves as experts

Page 5: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

Distributed environments

Page 6: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

Batch uploading to master

Nagios documentation suggests calling ocsp command after every result

send_nsca

Batch up requests and send all at once

Implemented as service_perfdata_file_processing_command

nsca --single

Page 7: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

Aggregated writes

Problem: if cmd is not available, NSCA writes to a dump file. But if the cmd file comes back, doesn’t switch back

Affects Nagios reloads

Solution: when dump file is used, keeping checking for cmd file

In NSCA CVS HEAD, but not released yet

Page 8: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

Distributing CGI commands

Problem: Submitting a command via CGI on master should go down to slaves

Solution: broker module altinity_distributed_commands. For a selected list of external commands, writes to a cache file

Page 9: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

Some commands cannot be sent to slaves

DEL_HOST_SVC_DOWNTIME;hostname

Distributable API commands

Page 10: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

Freshness calculations

Arbitrary 15 seconds

We set to 30 minutes

In Nagios 3: additional_freshness_latency

Included libtap tests

Page 11: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

NRPE

Page 12: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

Centrally managed agents

Set allowed_hosts to blank

Commands defined to pass arguments from check_nrpe

command[check_disk]=/usr/local/nagios/libexec/check_disk $ARG1$

Page 13: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

Backwards compatible

Increases to whatever is nrpe server is compiled with

Remote hostNagios server

check_nrpe nrpe

Remote hostNagios server

check_nrpe nrpe

Increased output

Page 14: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

Nagios

Page 15: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

Problem: Only want to show services specific to a user, not all services on the host

Solution: Removed authentication that allows host

Slicing services in CGI

Page 16: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

Solution: Removed authentication that allows host

Slicing services in CGI

Page 17: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

Initial states

Problem: Services and hosts go into a PENDING state. But this affects reporting and state changes because there has never been a result received. Also, there’s no entry in nagios_hoststatus/nagios_servicestatus

Solution: Create a broker module to send an UP/OK for all hosts/services

Page 18: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

Changing command based on timeperiod

Could do via Nagios API

But requires external process to submit

Can now do via configuration:

define service { ...check_timeperiod_command workhours,command_name}

Page 19: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

Notification logic performance tuning

Problem: 100% cpu!

Solution: strace on nagios showed lots of time spent in notification logic, calculating macros

Page 20: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

NDOutils

Page 21: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

Case insensitive object names

nagios_objects is the key table

But name1, name2 are case insensitive

HostnameA and hostnamea are the same host in NDO

But not in Nagios

ALTER TABLE nagios_objects MODIFY name1 varchar(128) COLLATE latin1_bin

Page 22: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

Indexing

Mysql can only use one index per table per query

Multi column indexes have important ordering

(instance_id, service_object_id, start_time, start_time_usec)

(start_time, instance_id, service_object_id, start_time_usec)

Use EXPLAIN to work out how Mysql will tackle your query

Page 23: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

Asynchronous imports

Problem:

broker modules are run synchronously

ndo2db also runs synchronously

Everything waits!

Nagios

ndomod

ndo2db Disk

Nagios

ndomod

import_ndologsd

ndo2db DBfile2sockDirectory

Page 24: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

Asynchronous imports, 2

File IPC

Larger blocksize for file2sock

Host failures also rotate

Page 25: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

Performance improvements

Strip unnecessary data being sent to NDO

Broker level

ndomod level

Helper tables - invoked at configdumpend

Multi valued inserts

Housekeeping external

Page 26: Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2   Moral duty Business benefit: Support moves to core projects Easier for us to upgrade

Summary