OSMC 2014: Introduction into collectd | Florian Foster
description
Transcript of OSMC 2014: Introduction into collectd | Florian Foster
collectdAn introduction
About me
● Florian "octo" Forster
● Open-source work since 2001
● Started collectd in 2005
Agenda
● collectd
● Aggregation of metrics
● Alerting with Icinga
Agenda
● collectd
● Aggregation of metrics
● Alerting with Icinga
collectd
● Daemon
● collect metrics
● mangle / transport metrics
● store metrics (no retrieve)
collectd
● Open-source project○ MIT and GPL licensed
● Platform independent○ Linux, BSD, Solaris, AIX, HP-UX, …○ Windows via SSC Serv (non-free)
collectd
● Agent based design○ Runs on each host
● Extensible via plugins○ Language bindings (Perl, Python, Java)○ "exec" plugin, e.g. shell scripts
collectd
● 95+ "read" (input) plugins
○ System metrics (e.g. CPU, memory)
○ Application metrics (e.g. MySQL)
○ Other (Xeon Phi, SNMP, OneWire)
collectd
● 15+ "write" (output) plugins
○ Graphite○ RRDtool○ RRDCacheD○ Riemann○ MongoDB○ HTTP (generic)
collectd
# Input
LoadPlugin cpu
LoadPlugin memory
LoadPlugin df
<Plugin df>
MountPoint "/"
ValuesPercentage true
</Plugin>
# Output
LoadPlugin write_graphite
<Plugin write_graphite>
<Node "default">
Host "graphite.example.com"
</Node>
</Plugin>
Example configuration
collectd
● collectd's write_graphite plugin
○ Sends metric to Graphite○ TCP or UDP transport○ Metric names somewhat adjustable
→ Monitoring mit Graphite(15:30 in this room, German)
Agenda
● collectd
● Aggregation of metrics
● Alerting with Icinga
Aggregation
● Aggregates often more useful for alerting○ e.g. sum over CPUs, minimum RTT
● Metric storage often I/O bound
● Dashboards require "sane" amount of information
Aggregation
collectd Graphite
CPU
Disk
Memory
…Aggregation
Aggregation
● Load the Aggregation plugin
● Select (filter) applicable metrics
● Group by metric type and other fields
● Aggregate functions (e.g. sum)
Aggregation
LoadPlugin aggregation
<Plugin aggregation>
<Aggregation>
</Aggregation>
</Plugin>
example.com/battery/percent-charged
example.com/cpu-0/cpu-idle
example.com/cpu-0/cpu-user
example.com/cpu-0/cpu-wait
example.com/cpu-1/cpu-idle
…example.com/df-root/df_complex-free
example.com/df-root/df_complex-used
example.com/df-root/df_complex-rsvd
…
Load the aggregation plugin
Aggregation: Selection
● Five fields usable for selection
○ Host○ Plugin○ PluginInstance○ Type (mandatory)○ TypeInstance
Aggregation: Selection
LoadPlugin aggregation
<Plugin aggregation>
<Aggregation>
Plugin "cpu"
Type "cpu"
</Aggregation>
</Plugin>
example.com/cpu-0/cpu-idle
example.com/cpu-0/cpu-user
example.com/cpu-0/cpu-wait
example.com/cpu-1/cpu-idle
example.com/cpu-1/cpu-user
example.com/cpu-1/cpu-wait
example.com/cpu-2/cpu-idle
example.com/cpu-2/cpu-user
example.com/cpu-2/cpu-wait
…
Select metrics
Aggregation: Grouping
● Four fields usable for selection
○ Host○ Plugin○ PluginInstance○ TypeInstance
● One field unspecified (or more)
Aggregation: Grouping
LoadPlugin aggregation
<Plugin aggregation>
<Aggregation>
Plugin "cpu"
Type "cpu"
GroupBy Host
GroupBy TypeInstance
</Aggregation>
</Plugin>
example.com/cpu-???/cpu-idle
example.com/cpu-???/cpu-user
example.com/cpu-???/cpu-wait
Configure grouping
Aggregation: Functions
● Up to six aggregate functions
○ Count○ Sum○ Minimum○ Maximum○ Average○ Standard deviation
Aggregation
LoadPlugin aggregation
<Plugin aggregation>
<Aggregation>
Plugin "cpu"
Type "cpu"
GroupBy Host
GroupBy TypeInstance
CalculateSum true
</Aggregation>
</Plugin>
example.com/cpu-sum/cpu-idle
example.com/cpu-sum/cpu-user
example.com/cpu-sum/cpu-wait
Select aggregate function(s)
Aggregation
● Creates additional metrics
● Use chains to filter out unwanted "raw" metrics.
● Usable on client and/or server.
Agenda
● collectd
● Aggregation of metrics
● Alerting with Icinga
Alerting
● Load the Unixsock plugin
● Query and check values with collectd-nagios
● Both come with collectd
Alerting
LoadPlugin unixsock
<Plugin unixsock>
SocketFile "/var/run/collectd-unixsock"
SocketGroup "collectd-nagios"
SocketPerms "0660"
DeleteSocket true
</Plugin>
Load the Unixsock plugin
Alerting
-> GETVAL example.com/cpu-average/cpu-wait
<- 1 Value found
<- value=8.540017+e00
Query values with the Unixsock plugin
Alerting
● collectd-nagios queries and checks metrics
● Ranged -w (warn) and -c (critical) options
● Conforms to Icinga's best practices
Alerting
$ collectd-nagios -s /var/run/collectd-unixsock \
> -n cpu-average/cpu-wait -H example.com \
> -w '0:10' -c '0:25'
OKAY: 0 critical, 0 warning, 1 okay | value=8.540017;;;;
Example: collectd-nagios
Alerting
define command{ command_name check_cpuio_collectd command_line collectd-nagios \
-H $HOSTNAME$ \
-n cpu-average/cpu-wait \
-w $ARG1$ -c $ARG2$
}
define service{ use generic-service host_name example.com service_description I/O wait check_command \
check_cpuio_collectd!10:!5: }
commands.cfg services.cfg
Alerting
● What's next?
○ Use "passive checks"
○ Let collectd push metrics to Icinga 2?
○ Bring on the patches!
Thank you!
Thank you!
Questions?
It's time for
Questions