GridView - A Monitoring & Visualization tool for LCG
description
Transcript of GridView - A Monitoring & Visualization tool for LCG
GridView - A Monitoring & Visualization
tool for LCGRajesh Kalmady, Phool Chand,
Kislay Bhatt, D. D. Sonvane, Kumar Vaibhav
B.A.R.C.
BARC-CERN/LCG Meeting 15.09.2006
Gridview : New Developments(During 27th April to 15th September)
• Enhancements to Gridftp file transfer monitoring
• Development of summarization and presentation modules for
– Job Monitoring
– Service Availability Monitoring
• Deployment of all the new developments to production system
File Transfer Monitoring
• Enhanced Gridftp summarization and presentation modules for– VO-wise distribution of overall data transfers– VO-wise distribution of data transfers per Site– Site-wise distribution of data transfers per VO
• Developed graphs and reports for data transfers from all sites to a given site (Hourly, Daily reports)
File Transfer Monitoring : Overall VO-wise Details
File Transfer Monitoring : Site-wise details for a particular VO
Job Monitoring• Developed summarization module for
computation of job statistics • Developed presentation module to display
periodic Graphs and Reports for– Job Status (Total Number of Jobs in various States)– Job Success Rate– Job Resource Utilization (Elapsed time,CPU, Memory)– Average Job Turnaround time (RB Waiting, Site
Waiting, Execution Time)– Site, VO and RB-wise distribution– Hourly, Daily, Weekly and Monthly reports
Job Monitoring (Cont…)• Developed periodic Graphs and Reports for
– Overall Summary• sites with high/low job execution rate• sites with high/low job success rate• VOs running more/less jobs etc
– Possible to view job statistics for any user selected combination of VO, Site and RB
Job Status : State-wise Distribution
Job Status : VO-wise Distribution
Job Status : RB-wise Distribution
Job Status : Site-wise Distribution
Job Monitoring : Job Success Rate
Job Monitoring : Average Job Turnaround time
Service Availability Monitoring • Developed summarization module for computation of
Service Availability – based on SAM Test Results – AND (critical services) of OR (redundant services)
• Developed presentation module to display periodic Graphs and Reports for– Central Service Availability (FTS, LFC, RB)– Aggregate tier-1 site Availability– Site-wise availability for individual tier-1 sites– Site-wise service availability of tier-2 sites (grouped by
associated VOs)– Detailed availability of various services (CE, SE, SRM) and their
individual instances running at a particular site
Service Availability Monitoring (Cont…)
• Reports on Hourly, Daily, Weekly and Monthly basis
• Tracability from Aggregate Availability to Individual Service Instance Availability
• Provision for saving user preferences based on certificates
Service Availability Monitoring : Central Service Availability
Service Availability Monitoring : FTS Instance Availability
Service Availability Monitoring : Aggregate T1 Site Availability
Service Availability Monitoring : Tier-1 Site Availability
Service Availability Monitoring : Site Detail Availability
On-going Work
• Presentation of Detailed SAM test results for traceability from Availability Graphs to corresponding tests
• Development of Weekly and Monthly reports for All to Given site data transfers
• Modification to Gridftp file transfer GUI and Reports in order to enable Multiple site selection (new request)
Future Work
• Visualization of FTS Statistics
• Archival of Job data for jobs submitted directly to CE
• Interfacing GridView with Information System (Top level BDII) for Resource Availability– Compute nodes (WNs), Storage etc
Future Work : Visualization of FTS Statistics
• Currently GridView visualizes gridftp data transfer rates across the sites.
• FTS statistics to be visualized include
– Successful transfers
– Failure rates
– VO-wise, FTS server-wise and Channel-wise details of data transfers
Problems
• No data is being published to R-GMA table JobMonitor since 2 months (in spite of repeated reminders)
• Gridview Availability Depends on – R-GMA Service– Oracle Database Service– SAM/SFT tests
• Instabilities in Gridview service caused by– R-GMA Instabilities
• Registry failures, Monbox failures, Data loss etc.
– Occasional Oracle downtime– Unannounced software upgrades on production machines leading to
broken code
• Subsequently, Gridview address added to cern-quattor-announce mailing list and upgrades done manually by Gridview team
Thank You
Your comments and suggestions please