Fabric Management for CERN Experiments Past, Present, and Future
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in...
Transcript of GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1 Fabric monitoring for LCG-1 in...
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 1
Fabric monitoring for LCG-1in the CERN Computer Center
Jan van Eldik
CERN-IT/FIO/SM
7th GridPP Collaboration meeting
July 1, 2003
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 2
Outline
• Fabric monitoring developments at CERN
• Architectural overview
• Deployment: status & plans for LCG-1
• Outlook
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 3
Fabric Monitoring at CERN
• Improved fabric management is key part of LCG programme
• EDG WP4 develops tools for automated installation, configuration, fabric monitoring, fault tolerance
• IT/FIO Supervision & Monitoring section: develop and deploy a monitoring solution for LHC-era
• A lot of expertise: EDG WP4 monitoring developments,PVSS Scada studies, SNMP studies, operator alarm displays, …
• Architecture based on functional requirements gatheredby PEM project
• Important objective: fabric monitoring for LCG-1 at Cern
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 4
Requirements and architecture
Measurement RepositoryMonitored nodes
SensorMonitoring Sensor
Agent
CacheConsumerLocal Consumer
SensorSensor
ConsumerConsumer
Global Consumer
Database
• Both for performance and exception monitoring
• Local and global consumers
• Scalable, extensible, robust
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 5
EDG WP4 implementation
Measurement Repository (MR)Monitored nodes
SensorMonitoring Sensor
Agent (MSA)
CacheConsumerLocal Consumer
SensorSensor
ConsumerConsumer
Global Consumer
Monitoring Sensor Agent• Calls plug-in sensors to sample configured metrics• Stores all collected data in a local disk buffer•Sends the collected data to the global repository
Plug-in sensors• Programs/scripts that implements a simple sensor-agent ASCII text protocol•A C++ interface class is provided on top of the text protocol to facilitate implementation of new sensors
The local cache•Assures data is collected also when node cannot connect to network•Allows for node autonomy for local repairs
Transport• Transport is pluggable.• Two protocols over UDP and TCP are currently supported where only the latter can guarantee the delivery
Measurement Repository• The data is stored in a database•A memory cache guarantees fast access to most recent data, which is normally what is used for fault tolerance correlations
Database
Repository API•SOAP RPC•Query history data•Subscription to new data
Database•Proprietary flat-file database•Oracle•Open source interface to be developed
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 6
Deployment status in Cern CC
• MSA with sensors for performance and exception monitoring, measuring 100-150 quantities per box
• Deployed on ~1500 RedHat Linux nodes
• 30 clusters, with specific configuration files
Batch 1000 nodes
Interactive 70 nodes
Disk server 200 nodes
Tape server 80 nodes
WWW, DB, MISC 200 nodes
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 7
Status of exception monitoring
• ~50 possible alarms per monitored nodeHighLoad, DaemonDead, FileSysFull, install / config problems
• Operator alarm displays– PVSS-based, developed as part of PVSS-tests– WP4 alarm display under active development
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 8
PVSS operator alarm display
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 9
WP4 operator alarm display
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 10
Performance monitoring
• WP4 Measurement Repository with Oracle backendis currently being deployed in the CERN CC for LCG-1
• Data access– C-API to the repository is available,
Perl and Java implementations to be done– Simple CLI is being delivered– GUI is being delivered
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 11
Anamon
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 12
Open issues
• Current solution is still very node-centric• Not much experience with consumers• No correlations engines, no corrective actions yet…• Integration with configuration system to be done
GridPP7 – June 30 – July 2, 2003 – Fabric monitoring– n° 13
Summary and Outlook
• Fabric monitoring infrastructure for LCG-1 at Cernis being deployed
• Monitoring Sensor Agent has been operating very well• Measurement Repository will now be challenged• Consumers can start consuming…• An interesting 6 months period await us!