Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Alarming with GNI VOC WG...
-
Upload
roland-berry -
Category
Documents
-
view
219 -
download
0
description
Transcript of Computing Facilities CERN IT Department CH-1211 Geneva 23 Switzerland t CF Alarming with GNI VOC WG...
Computing Facilities
CERN IT Department
CH-1211 Geneva 23
Switzerlandwww.cern.ch/
it
CF
Alarming with GNI
VOC WG meeting12th September 2013
Computing Facilities Agenda
• GNI Overview• Metrics Manager
– Metric Registration– Metric Workflow – Quattor Legacy
• Lemon Producer• GNI Consumers
– Service Now Integration– GNI Dashboard– No Contact Processor
• Current Status and Next steps
Alarming with GNI - 2
Computing Facilities
GNI Overview
Alarming with GNI - 3
Computing Facilities Architecture
Alarming with GNI - 4
Computing Facilities
Metrics Manager
Alarming with GNI - 5
Computing Facilities Metric Registration
• Lemon Metric Manager: https://metricmgr.cern.ch• Single entry point for Quattor & Puppet metrics configuration• Keeps default parameters setting and assign responsibility
– Metrics parameters overloading available via puppet• Lemon metrics concept:
– Sensor implements multiple metric classes definition– Metric class can be used for multiple metrics definition
Alarming with GNI - 6
PuppetHiera
node
LemonAgent
LemonForwarder
configuration files
MetricManager
Computing Facilities Metric Workflow
• Supports puppet only and puppet + quattor metrics
• New metrics:– Draft: user defines metric– Pending: user submits metric for approval, itmon team verifies– Production: itmon team propagates new metric to agent definitions
• Metrics already in Quattor: – Legacy: metric was imported from Quattor but is not enabled in Puppet– Production: itmon team propagates metric to lemon agent definitions
• Changes to production metrics:– Production: user changes metric definition– Production: itmon team propagates metric to lemon agent definitions
• Further details: https://metricmgr.cern.ch/help/
•
Alarming with GNI - 7
Computing Facilities Quattor Legacy
• Metric definition must still be added to Quattor– Copy the generated Quattor code into a CDB template – e.g. under prod/pro_monitoring_*.tpl
Alarming with GNI - 8
Computing Facilities
Lemon Producer
Alarming with GNI - 9
Computing Facilities Lemon Producer
• Main components:– Lemon agent and sensors: no changes– Lemon forwarder: wrapping lemon data to JSON format– Lemon tools: no changes to lemon-host-check and lemon-cli
• Notifications send based on lemon exceptions (alarms)• Notifications can be customized in the node:
– Can be configured via puppet (How-to)– Overwrites defaults in metrics manager
• Users can create other notifications Alarming with GNI - 10
PuppetHiera
node
LemonAgent
LemonForwarder
configuration files
MetricManager
Computing Facilities
GNI Consumers
Alarming with GNI - 11
Computing Facilities Service Now Integration
• Takes notifications marked for incident creation• Checks if notification should be masked• Opens Incidents in SNOW• Re-submits notification with incident ID
• Supports masking of ticket creation• Today takes alarmed flag defined in Foreman
– Requires successful puppet run• In the future it will be integrated with Roger
– Developed by config team – Prototyping phase
Alarming with GNI - 12
Computing Facilities Integration with Roger
• Masking in Roger– Service providing information about host state and masking state– Set masking for no contact notifications and 3 notification types:
• Hardware, OS, Application
• All exceptions must be classified under a notification type:– Hardware, OS, Application
• FE responsibles will be asked to classify their exceptions
Alarming with GNI - 13
Computing Facilities GNI Dashboard
Alarming with GNI - 14
Computing Facilities No Contact Processor
• Heartbeat from lemon metric updates• Processor looks at heartbeat timeout• Raises GNI notification
– Creates SNOW incident for CC Operator• If node comes back
– Closes GNI notification• Possible to mask with ROGER
Alarming with GNI - 15
Computing Facilities Current Status & Next Steps
• Current status– Deployed dev and prod instances of GNI, including Metric Manager– Migrated from Apollo to ActiveMQ– Integrated with training instance of Service Now
• Next Steps– Integrate Roger service for run-time notification type masking – Review default exception configuration – Start opening SNOW incidents for hardware notifications – Redirect production GNI instance to production Service Now
Alarming with GNI - 16
Computing Facilities
¿Questions?
http://cern.ch/itmon
Alarming with GNI - 17