Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24,...
Transcript of Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24,...
![Page 1: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/1.jpg)
Linux Clusters Institute:Monitoring
Zhongtao Zhang, System Administrator, Holland Computing Center, University of Nebraska-Lincoln
![Page 2: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/2.jpg)
Why monitor?
2May 24, 2017
![Page 3: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/3.jpg)
Service Level Agreement (SLA)
•Which services must be provided by you?•Which services must be provided to you?
• Regulatory requirements• Contractual requirements• Business requirements
• Common Deliverables•Availability of services (Uptime)•mean time between failures (MTBF)•mean time to repair or mean time to recovery (MTTR)
3May 24, 2017
![Page 4: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/4.jpg)
Monitoring and Notification Basic Flow
4May 24, 2017
Metric collection Metric aggregation
Metric Transformation Metric Analysis
Presentation Notification
Automated Gathering of Metrics
Evaluate Metrics against Requirements
Prepare and share metrics for Stakeholders
![Page 5: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/5.jpg)
What to Collect (Metrics)
5May 24, 2017
• Overall cluster health• Queue size
•Jobs running•Jobs Queued
• Overall network usage• Number of responding nodes
• Individual node health• Load average• Memory used• Network bandwidth• CPU usage• Temperature
• Storage• Capacity• Degraded status• Connectivity
•Security•Logs of everything
•Power status•temperatures
•Cold-aisle•Switches exhausts•CPU temperatures
![Page 6: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/6.jpg)
Metric Collection• Collection Tools (Common
Tools)•Ganglia•Collectd•Perfmon•Performance Co-pilot (PCP)•Nagios•Unified Fabric Manager (UFM)•Cacti•Syslog•TACC stats•Scripts
6May 24, 2017
Collection tools already exist to capture most metrics.
No single tool will do everything you need unless you write it yourself
Try to avoid re-inventing the wheel.
![Page 7: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/7.jpg)
Metric Aggregation• Aggregation Tools
•Ganglia•Collectd•Performance Co-pilot (PCP)•Nagios•Unified Fabric Manager (UFM)•Cacti•Syslog•Round Robin Database (RRD)
7May 24, 2017
Metrics need to be gathered from all over the cluster to a single place for analysis and storage
Most metrics should transfer over the Management Ethernet to avoid interference with Job performance in Low Latency interconnect
![Page 8: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/8.jpg)
Metric Analysis and Transformation
• Monitoring Conundrum•Data is useless unless we do something with it•We can collect much more data than we can analyse•We generally won’t know what data we need until we need it
•Exception: Data we must provide for SLA requirements
•Limited storage and processing capacity for metric analysis
8May 24, 2017
![Page 9: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/9.jpg)
Notification
9May 24, 2017
• Notification Tools•Nagios•Icinga•Zenoss•Zabbix•PRTG•OpenNMS•OP5•Pandora FMS•Unified Fabric Manager (UFM)
Basic functionality of all alerts: Red, Yellow, Green
Most notification tools are forks or clones of Nagios
Notification tools can be passive or active in querying the status of the cluster
![Page 10: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/10.jpg)
Notification
•Monitoring for known evil•Basis for all notifications•Only alert if something known bad happens
• Metrics -> Notifications•Most tools will require extensive configuration to be useful •Most tools will have a way to query metrics and create alerts
•Some tools, such as Nagios, have this entire process built in•Others will have ways to bolt on this functionality
• Nagios can query Ganglia
• Ganglia can query Nagios
10May 24, 2017
![Page 11: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/11.jpg)
How should we get notified?
• Emergency•Fire and smoke exiting machine
• Urgent:• Email or text or phone call• Define this carefully
• Not-so urgent:• Web page updates
• Especially helpful for historical data• Email (filtered)• End-user support requests
11May 24, 2017
![Page 12: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/12.jpg)
SLA based Alerts
• Alerts on Deliverables•Availability of services (Uptime)
•Example: Alert if less than 98% of batch nodes are online•mean time between failures (MTBF)
•Example: Send email report of time between failures•mean time to repair or mean time to recovery (MTTR)
•Example: Alert if a down node does not come online after 4 hours down
12May 24, 2017
![Page 13: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/13.jpg)
How often to alert?
• SLA requirements•If your SLA requires it, you may will need to get called off-hours or even on holidays
• You will quickly get a feel for this• Too much info is often worse than too little info• The “urgent” – continually• The “not-so-urgent” – anywhere from a few times per day to once per week
• There’s nothing wrong with trial and error• Consider aggregated reports for ‘not-so-urgent’
13May 24, 2017
![Page 14: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/14.jpg)
• Securing the cluster•Security alerts may need to go to specific groups or people instead of normal operations
• Regulations and Security rules may apply to cluster which must be enforced•Compliance to Regulations: Sarbanes Oxley, Fisma, HIPAA, etc
• Active response may need to be required such as blocking IPs• Security status updates• Alerts on security failures
• sudo reports• Network login failures (e.g. fail2ban)• crontab failures• Logfile errors (customize to fit)
Security Alerts
14May 24, 2017
![Page 15: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/15.jpg)
• Centralized Log Management• Make troubleshooting easier• Analysis your logs• Reduce the data loss risk• ELK stack (https://www.elastic.co):
• Logstash: Collects and parses logs• Elasticsearch: Stores and indexes logs• Kibana: Virtualization and analysis.
Log Management
15May 24, 2017
![Page 16: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/16.jpg)
Example: Nagios
16May 24, 2017
![Page 17: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/17.jpg)
Example: Nagios
• Nagios/NRPE (Nagios Remote Plugin Executor)• Generic executable that runs “plugins”
• Plugins can monitor just about anything you can think of monitoring• Even works with Windows• Nagios (http://www.nagios.org/) is by far the most common monitoring
system
17May 24, 2017
![Page 18: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/18.jpg)
Example: Check_MK
18May 24, 2017
https://mathias-kettner.de/bilder/2.png
![Page 19: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/19.jpg)
Example: Check_MK
• Check_MK (http://mathias-kettner.com/check_mk.html)• Much faster than Nagios• Graphing tools and web GUI• Agent-side and server-side checks• Easy install and configure with OMD• Used by CERN (and HCC).
19May 24, 2017
![Page 20: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/20.jpg)
Example: Ganglia
20May 24, 2017
![Page 21: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/21.jpg)
Example: Ganglia
• Ganglia (http://ganglia.sourceforge.net/) - for historical and resource monitoring
• Ours are public• RRD files give historical data (a.k.a. “lots of pretty graphs”)
21May 24, 2017
![Page 22: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/22.jpg)
Example: Grafana
22May 24, 2017
![Page 23: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/23.jpg)
Example: Grafana
• Grafana (https://grafana.com) – General purpose dashboard• InfluxDB (https://www.influxdata.com) – Time series database • Collectd (https://collectd.org) – Collect metric and send to InfluxDB
23May 24, 2017
![Page 24: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/24.jpg)
Example: ELK stack
24May 24, 2017
http://blog.tarams.com/index.php/2015/elk-stack-search-and-analytics-platform/#sthash.87qX4g92.dpbs
![Page 25: Linux Clusters Institute: Monitoring...• Logfile errors (customize to fit) Security Alerts May 24, 2017 14 • Centralized Log Management • Make troubleshooting easier • Analysis](https://reader034.fdocuments.in/reader034/viewer/2022042309/5ed6d0c3c7a5935b07521e57/html5/thumbnails/25.jpg)
Monitoring Future
• Large data analysis using machine learning• Noise reduction• Anomaly detection• Correlation
• Examples:• Moogsoft (https://www.moogsoft.com)• Metricly (https://www.metricly.com/product)• Anodot (https://www.anodot.com)
25May 24, 2017