Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2 ...
Transcript of Lessons in Nagios learnt from developing Opsview · PDF fileWhy? Obligation – GPLv2 ...
Lessons in Nagios learnt from developing OpsviewTon Voon
Altinity Limited
September 2008
Copyright Ton Voon. Released under Creative Commons, Attribution-Noncommercial
Classified: Top secrets of Nagios in OpsviewTon Voon
Altinity Limited
September 2008
Copyright Ton Voon. Released under Creative Commons, Attribution-Noncommercial
“Opsview”?
Our monitoring solution
Database back end
Web front end
Configuration and status
Distributed
Open source
Why?
Obligation – GPLv2
http://trac.opsview.org/browser/trunk/opsview-base
Moral duty
Business benefit:
Support moves to core projects
Easier for us to upgrade base code
Advertise ourselves as experts
Distributed environments
Batch uploading to master
Nagios documentation suggests calling ocsp command after every result
send_nsca
Batch up requests and send all at once
Implemented as service_perfdata_file_processing_command
nsca --single
Aggregated writes
Problem: if cmd is not available, NSCA writes to a dump file. But if the cmd file comes back, doesn’t switch back
Affects Nagios reloads
Solution: when dump file is used, keeping checking for cmd file
In NSCA CVS HEAD, but not released yet
Distributing CGI commands
Problem: Submitting a command via CGI on master should go down to slaves
Solution: broker module altinity_distributed_commands. For a selected list of external commands, writes to a cache file
Some commands cannot be sent to slaves
DEL_HOST_SVC_DOWNTIME;hostname
Distributable API commands
Freshness calculations
Arbitrary 15 seconds
We set to 30 minutes
In Nagios 3: additional_freshness_latency
Included libtap tests
NRPE
Centrally managed agents
Set allowed_hosts to blank
Commands defined to pass arguments from check_nrpe
command[check_disk]=/usr/local/nagios/libexec/check_disk $ARG1$
Backwards compatible
Increases to whatever is nrpe server is compiled with
Remote hostNagios server
check_nrpe nrpe
Remote hostNagios server
check_nrpe nrpe
Increased output
Nagios
Problem: Only want to show services specific to a user, not all services on the host
Solution: Removed authentication that allows host
Slicing services in CGI
Solution: Removed authentication that allows host
Slicing services in CGI
Initial states
Problem: Services and hosts go into a PENDING state. But this affects reporting and state changes because there has never been a result received. Also, there’s no entry in nagios_hoststatus/nagios_servicestatus
Solution: Create a broker module to send an UP/OK for all hosts/services
Changing command based on timeperiod
Could do via Nagios API
But requires external process to submit
Can now do via configuration:
define service { ...check_timeperiod_command workhours,command_name}
Notification logic performance tuning
Problem: 100% cpu!
Solution: strace on nagios showed lots of time spent in notification logic, calculating macros
NDOutils
Case insensitive object names
nagios_objects is the key table
But name1, name2 are case insensitive
HostnameA and hostnamea are the same host in NDO
But not in Nagios
ALTER TABLE nagios_objects MODIFY name1 varchar(128) COLLATE latin1_bin
Indexing
Mysql can only use one index per table per query
Multi column indexes have important ordering
(instance_id, service_object_id, start_time, start_time_usec)
(start_time, instance_id, service_object_id, start_time_usec)
Use EXPLAIN to work out how Mysql will tackle your query
Asynchronous imports
Problem:
broker modules are run synchronously
ndo2db also runs synchronously
Everything waits!
Nagios
ndomod
ndo2db Disk
Nagios
ndomod
import_ndologsd
ndo2db DBfile2sockDirectory
Asynchronous imports, 2
File IPC
Larger blocksize for file2sock
Host failures also rotate
Performance improvements
Strip unnecessary data being sent to NDO
Broker level
ndomod level
Helper tables - invoked at configdumpend
Multi valued inserts
Housekeeping external
Summary