GRNET NOC Use puppet and network inventory to populate nagios/icinga configuration TF ... ·...
Transcript of GRNET NOC Use puppet and network inventory to populate nagios/icinga configuration TF ... ·...
http://www.grnet.gr
GRNET NOC
Use puppet and network inventory to populate nagios/icinga configuration
TF-NOC Dublin
Alexandros Kosiaris ([email protected])
Network & Equipment
•Storage Equipment: Netapp/IBM N5300 EMC Celerra NS-480
•Computing Equipment: • Virtualization (KVM)
12 Blade servers, HP BL-460c 12 IBM 1U Servers 128 1U Fujitsu Servers 275 2U HP Proliant Servers ~200 Vms
Optical Network: ~70 cities (+30 within next year) 15years-leased dark fiber DWDM/CWDM network
Optical Equipment:
Alcatel 1626LM, 1696MS, 1678MCC Adva FSP2000
Routing Equipment: Juniper T1600, Juniper MX960 ~10x Cisco 12000s, a few Cisco 7200s/7300s
Switching Equipment: Cisco 6500 Several Cisco 3750, Cisco 2970, Juniper ex4200, Extreme X450a/X350
Nagios + Network Equipment or (more accurately) Switching and Routing
In-house developed Network Inventory (a.k.a. GRNETDB)
•A MySQL database of almost 150 tables •Populated multiple times a day by a PHP discovery script
SNMP, telnet + expect •Basic Concepts:
Node Interface Layer Domain Location
•These concepts get extended to represent functionality Routing, Switching nodes Layer2, Layer3 interfaces Switching, administrative domains
In-house developed python Django project, with multiple sub-apps
•Network (the interface to the database) •RG (router graphs, take a peek at http://mon.grnet.gr/rg) •Maps (take a look at http://mon.grnet.gr/network/maps) •Hostmaster •Optical network (built mostly on Location info) •Nadjicingo
Builts on network app and generates a nagios/icinga configuration
•Nagvis Same thing but generates/updates nagvis config
Nagios + Network Equipment or (more accurately) Switching and Routing
Nadjicingo A Django management command outputing nagios/icinga configuration
•Run by crontab every hour (manage.py nadjicingo) •Will generate nagios configuration objects for
Routers Switches Interfaces
•L3 Topology aware (nagios hates cyclic dependencies – aka redundant links), populates parents field for most devices. •Hardware checks in devices •Business logic embedded in interface descriptions:
Part of it is a unique identifier for a customers link –[.NTUA-4] => National Technical University's L3 link –[AUTH@ERMOU-1] => Aristotle University of Thessaloniki L2 link at Ermou PoP
Nagvis A Django management command (again...)
•Run by crontab every hour (manage.py nagvis) •Will update a specific nagvis map configuration by:
Removing obsolete nodes Adding new nodes to a special area for manual positioning on map
•Also features an automated positioning mode based on devices Latitude Longitude.
Nice for showoff but not for overview in monitoring applications •Will only populate host objects in map. •Service objects cluttered it too much and information is rightly available anyway
Nagvis Network Map
Servers, Services ? A little bit of history
•For years, GRNET only had very basic services (DNS, email, Web) •And some router supporting services (Looking glass, mrtg, rancid) •And very few servers (<=10) •3 years ago, major paradigm shift from networking to services •20 Servers bought, and then 132 and recently 275 more •End user services were born:
Public cloud storage service (Pithos) Virtual Private Servers (ViMa) Students books statements (Eudoxus) Student Id cards (Paso) Public IaaS (Okeanos) Academic Professor Elections (Apella)
•Plus many other services and projects (TCS, Whois, NTP, VoD,…) •The result ? => 200 Vms were created for managing all this infrastructure
Puppet to the rescue What is Puppet?
•It's a stack of applications •It's a language (a declarative one as well) •It's a policy and state enforcing tool •It's a attribute and state discovery tool (kind of...) •It's a new paradigm in managing systems!
What is Puppet not?
•Not just an automation tool •Not a “For loop” •Not a command execution framework (it can be reduced to that though)
AGAIN: A new paradigm, you need to change the way you work
Puppet Concepts Facts
•Attributes of a system: OS Version and family Available memory CPUs Block devices IP addresses/netmasks MAC addresses And anything else you can write code for it to be discovered
LLDP neighbours IPMI functionality Hardware info Apache vhosts
•Discovered by facter and then made available to Puppet
Puppet Concepts(2) Resources
•Files, Directories •Users, Groups •Packages •Vlans •Interfaces •Nagios objects!!!! •And a lot more (http://docs.puppetlabs.com/references/latest/type.html)
Classes
•A way to group resources •Support inheritance and mixins (aka including) •The standard class has 3 resources defined •Package {'software': } •File { '/etc/software.conf': } •Service { 'softwared': }
Puppet Concepts(3) •Nodes
•A.k.a. machines (VM or hardware) •A node CAN (and probably will) have multiple puppet classes •Node population can be done in multiple ways: •Puppet language config •LDAP •External script
Puppetd agents running in each machine (daemon or crontab) Central Puppetmaster (with an RDBMS) holds all the configuration and data
Hello World example class helloworld {
file { '/tmp/helloworld': ensure => present, owner => root, group => root, mode => 640, content => 'Hello world' }
} node mynode { include helloworld }
Will create the /tmp/helloworld with all the attributes as defined above More importantly, if run again it will make sure to wipe any possible changes and restore the state as is defined above
Back to nagios Let’s use a puppet native type
nagios_host { “$hostname”: address => 10.10.10.10, alias => myhost, contact_groups => hostadmins, hostgroups => 'Puppeted Servers',
} /etc/nagios/nagios_host.cfg gets populated Problem is ...
•This is executed in the machine running puppetd not the nagios server.
No problem. Puppet supports exported resources.
Exported resources Let’s prepend the definition with two @ signs
@@nagios_service { 'myservice' contact_groups => hostadmins, host_name => $hostname, tag => 'collect_me_nagios_server', }
•Exports the resource but does not realize it on the machine running puppetd •No /etc/nagios/nagios_service.cfg file created
<<| Nagios_service tag == 'collect_me_nagiosserver' |>>
• In nagios server’s manifest. •/etc/nagios/nagios_service.cfg populated. •nagios,icinga.cfg can now just include the file/directory and monitoring begins
Simple example A manifest for all authoritative DNS servers Install bind9, install configuration and ensure it is running Open up firewall Setup a simple DNS check
class authoritativedns { include bind9 include service::dns
@@nagios_service { "authdns": command => "check_dig!www.grnet.gr", servicegroups => "DNS,DNS:Authoritative" }
}
Interesting use cases Class hierarchy means:
A base class nagios::host that is included in all other So all servers nagios-monitored without any intervention
But: A Server is physical and has IPMI capabilities: So export another nagios host for it
if $ipmi_capable {
@@nagios_host { "$ipmi_dns": address => $ipmi_ipaddress, tag => "hardwarehost", }
}
Interesting use cases (2) Server is an HP Proliant Server class hp-health { package { [ 'hp-health', 'hpacucli' ]: ensure => present, } nagios::host::service { 'hpacucli': ensure => present, servicegroups => 'HARDWARE', command => 'check_nrpe!dsa-check-hpacucli!0', } nagios::host::service { 'hpasm': ensure => present, servicegroups => 'HARDWARE', command => 'check_nrpe!dsa-check-hpasm!0', } }
Interesting use cases (3) Multicast beacons (double exported resources!!!)
define ssmping_check($ipv4, $ipv6) { $local = $::fqdn $remote = $name if ($::ipaddress and $ipv4 and $local != $remote) { @@nagios_service { "ping-ssm-$remote-$local-v4": ensure => present, check_command => "check_nrpe!check_ssmping!$ipv4", host_name => $local, service_description => "Multicast from $remote SSM IPv4", } … } # export the checks... @@ssmping_check { $fqdn: ipv4 => $ipaddress, ipv6 => $ipv6address}
Interesting use cases (4) Standard checks for all servers nagios::host::service { "disk": command => "check_nrpe!check_disk!13% 7%", } nagios::host::service { "load": command => "check_nrpe!check_load!4,3,2 5,4,3", } nagios::host::service { "users": command => "check_nrpe!check_load!20 30", } nagios::host::service { "swap": command => "check_nrpe!check_swap!60 40", } nagios::host::service { "check_tainted": command => "check_nrpe!check_tainted!0", } nagios::host::service { "check_firewall": command => "check_nrpe!check_firewall!0", }
Problems arise /etc/nagios/*.cfg files can become quickly large
•However each resource collection reads the entire file •Problem solved by disabling collections and creating the entire config file every time, however a more elegant solution would be nice
Exported resources cost •Each is an entry in the database and they are not used for nagios alone. •Execution speed suffers and sometimes times out •Problem solved in database by adding some indexes... but is bound to show up again •Puppet devs know it, some effort goes there
Problems arise (2) Puppet's declarative language can cause problems at times
@@nagios_host { 'myhost':
Hostgroups => $myhostgroups }
•And host also has classes A,B,C apart from nagios class. •Which class is going to declare $myhostgroups?
•Multiple solutions exist, all of them not elegant. •Externally (via LDAP) •Fact based •Populated hostgroups, not hosts
Problems arise (3) Active checks cost. Not a Puppet issue but a nagios one
•check_mk •Distributed monitoring
Well obsess_over_services sucks… mod_gearman
•For now splitting the infrastructure in Networking Services
•But if Services grow more? Variable tagging on resources
@@nagios_service { 'myservice' contact_groups => hostadmins, host_name => $hostname, tag => 'collect_me_nagios_server_N',
}