IT ES MONITORING APPLICATIONS
(TECHNOLOGY, SCALE, ISSUES)Julia Andreeva
DASHBOARD APPLICATIONS
U\
Monitoring of the job processing
Analysis Production Real time and historical
views
Users
Opera- tion teams
Sites
Data management monitoring
Data transfer Data access
Operation teams
Sites
Publicity & Dissemination
Site Status BoardSite usabilitySiteView
Sites
Operation teams
Infrastructure monitoring
SitesGeneralpublic
WLCG Google Earth Dashboard
COMMON SOLUTIONSApplication ATLAS CMS LHCb ALICE
Job monitoring(multiple applications)
Ѵ ѴSite Status Board Ѵ Ѵ Ѵ Ѵ SUM Ѵ Ѵ Ѵ Ѵ DDM Monitoring ѴSiteView &GoogleEarth Ѵ Ѵ Ѵ Ѵ
COMMON SOLUTIONSApplication ATLAS CMS LHCb ALICE
Job monitoring(multiple applications)
Ѵ ѴSite Status Board Ѵ Ѵ Ѵ Ѵ SUM Ѵ Ѵ Ѵ Ѵ DDM Monitoring ѴSiteView &GoogleEarth Ѵ Ѵ Ѵ Ѵ
Global WLCG transfer monitor based on ATLAS DDM Dashboard is coming soon
COMMON SOLUTIONSApplication ATLAS CMS LHCb ALICE
Job monitoring(multiple applications)
Ѵ ѴSite Status Board Ѵ Ѵ Ѵ Ѵ SUM Ѵ Ѵ Ѵ Ѵ DDM Monitoring ѴSiteView &GoogleEarth Ѵ Ѵ Ѵ Ѵ
All applications are shared by 2/3/4 experiments
All applications are developed in a common framework, which includes common construction blocks, build and
test environment, common module structure, agent management, common repository
DASHBOARD ARCHITECTURE
RECENT MODIFICATIONS
Machine-readable format(JSON)
Client-side Ajax and java
script based UI
Externalapplications
Dashboardframework
UI is completely agnostic regarding information source. Better flexibility. Adding new information source or replacing an existing one is a straight forward task.Clear decoupling of the development tasks
USER INTERFACES
• Over last months redesign Dashboard UIs• Client-side Model-View-Controller architecture• Using jQuery and AJAX• Full bookmarking support• Lot of effort was given to evaluation of the design of large-scale
JavaScript WEB applications, jQuery libraries. Experience is well documented. Set up recommendations for the developers:https://twiki.cern.ch/twiki/bin/view/ArdaGrid/Libshttp://code.google.com/p/hbrowse/w/list
Possibly a dedicated presentation can be of interest for the members of the monitoring group
JOB MONITORING
• Provides information about data processing in the scope of a given VO
• Mainly based on the instrumentation of the job submission frameworks and therefore works transparently across various middleware platforms (OSG, ARC,gLite), various submission methods (pilots, etc…), various execution backends (Grid, local)
• Merges information about a given job from multiple information sources (unique job identifier is a requirement)
• Job monitoring applications are shared by ATLAS and CMS• The db schema and user interfaces are shared, basically same
implementation but adapted for a particular experiment• The information sources and transport mechanisms are different ,
correspondingly different implementation for the collectors• Keep track of all processing details on a single job level
10
JOB MONITORING ARCHITECTURE
Dashboard Data Repository
One per experiment
(ORACLE)
Data retrieval via APIs
Jobs runningAt the WNs
Message server (MonAlisa or
MSG)
Dashboard consumer
User WEB interfaces
Job submission client or server
Dashboard web server
Panda DBATLAS
Production DB
JOB MONITORING DB
• Currently implemented in ORACLE• Schema is normalized• CMS schema is being partitioned twice per month, ATLAS
schema is partitioned weekly• Some interfaces use pre-cooked aggregated data• However some of them use raw data. Though a lot of tuning was done recently to improve performance of
the applications which use raw data, still there is a room for improvements.
Foresee to try nosql solutions to be used as a cash for the UIMain issue is eventual performance degradation due to the
instabilities of the execution plan. Hopefully the situation might improve with migration to 11g
SOME NUMBERS
• ATLAS submits up to 800K per day, CMS up to 300 K jobs per day => 1 million jobs per day to follow. Get regular updates of job status changes (per job)
• Per job DB contains time stamps of job status changes, meta information about jobs, job status, error codes and error reasons, job processing metrics (CPU, wallclock, memory consumption, etc…), list of accessed files (short time only)
• Plus aggregated information in summary tables with hourly and daily granularity.
• Size of ATLAS job monitoring DB 380GB for 1.5 year of data. Daily growth over last months 1-5GB/day
DATA COLLECTION (CMS)
• For historical reason in the lack of messaging system provided as a middleware component, for CMS use MonAlisa as a messaging system.
• Works well. Currently for CMS use 3 ML servers. In order to scale can add more servers. ML can accept up to 5K messages per second. The bottleneck is rather on the level of data recording to the DB. Which is being constantly monitored.
• Below are plots for one of the servers , the one used the most. One bar corresponds to 5 minutes (1 collector loop).
~20K status update records are inserted every 5 minutes from a single server that is 50-100 Hz from a single server.In case of any delay in information update alarm is sent. Alarm is triggered by ORACLE scheduled job
DATA COLLECTION (ATLAS)
• For ATLAS main data flow comes from PANDA (100-150 Hz). • Single server deals well with the load.• Collector loop runs every 2 minutes• Had performance issues with the first collector
implementation. Collector redesign allowed to solve the problem. Mainly replacing triggers by stored procedures which are called from the collector main thread
Thanks to the CERN DBA for their support and suggestions• Same as for CMS, performance is constantly monitored and
alarms are sent in case of any delay
DATA ACCESS• All UI applications are run in parallel on two web servers
(behind the same alias). But no real load balancing. Would like to try to re-use what is used for ActiveMQ message brokers
• Access is monitored. It is steadily growing in terms of number of users, frequency of access and volume of accessed data.
• Awstats for a single CMS server (metrics should be multiplied by two)
Monthly access patterns:~ 3-4 k unique visitors (IP addresses)~ 2-3 M pages~ 300-400 GB bandwidth
ATLAS DDM DASHBOARD
Monitoring ATLAS DDM Data Registrations and Transfers
Statistics
Events
Web UI & API Server & Agents Database Consumers
CONSUMERS AND STATISTICS GENERATION
• 2 consumers (Apache) receive callbacks events from 11 DDM Site Service VO boxes. (~ 50 Hz)
• Callback events stored in monthly partitioned database (Oracle) tables and kept for at least 3 months.
• Statistics generation agents (Dashboard Agent) run every 10 minutes generating statistics into 10 minute bins by source/destination/activity. (~ 50 k records per day)
• Statistics aggregation agents (Dashboard Agent) run every 10 minutes aggregating statistics into 24 hour bins by source/destination/activity. (~ 4 k records per day)
• Statistics stored in monthly partitioned database (Oracle) tables and kept indefinitely.
• Size of the DB is 1625GB, the biggest one of all Dashboard DBs. Daily growth over last months 1-5GB/day
WEB UI AND WEB API
• Statistics and event details available via web UI and web API• Web API (Dashboard Web) provides CSV/XML/JSON formatsWeb UI (Dashboard Web + AJAX/jQuery) provides highly
flexible filtering for statistics matrix and plots• Monthly access patterns:
~ 1 k unique visitors ~ 20 M page hits ~ 400 GB bandwidth > 90 % traffic for web API (50 % single user)
PAST SCALE ISSUES
• Deadlocks in DB due to too many connections from consumers Solution: Restrict number of connection pools (Apache thread
model) and connection pool size (Dashboard)
• Some publishers monopolise consumer due to diverse latency Solution: Additional consumer for high latency publishers
• Statistics generation procedures run too slow Solution: Split procedures to run in parallel and use bulk SQL (Oracle)
• Web UI and API queries for extended time periods too slow Solution: Aggregate statistics into 24 hour bins in separate DB tables
• Web server memory usage too high Solution: Generate plots on the client (HighCharts)
CURRENT SCALE ISSUES
• Occasional execution plan instabilities Plan: Investigate Oracle 11g SQL plan management to improve
stability Many thanks to DBAs for support in fixing instabilities when they
occur
• High load on web API from a few clients Plan: Work with users to develop more efficient API that meets their
requirements
• Consumers are approaching their load limit Plan: Investigate message brokering (ActiveMQ) as buffer to simplify
bulk inserts
SITE STATUS BOARD (SSB)
• Deployed for the 4 experiments• Gather metrics for all entities
Metrics are defined by experiment. Can be dynamically created Originally, entity=site. Now, entity could be ‘channel’ Measurement= Start/End Time, Value, Color, Site, URL More than 370 metrics for all experiments:
• Metrics gathered by collectors• Refresh rate between 10 min to 7 days
• Present latest state and historical information• Present different views (view is a set of metrics)
More than 40 views• Different ORACLE databases per experiment
CMS: 87 M entries, 20 collectors; ATLAS: 50 M, 3 collectors
SSB DATA FLOW JAN 2011
SavannahLatest results
COLLECTORS
Free text
BDII
Job Eff.
…
Topology
Historical data
COLLECTOR IMPROVEMENTS
• Too many different writers (LOCKING)! Use temporary files, and a single writer
• Huge table for historical values Partition by hash of metric and time
• Insertion rate too slow (1 second/entry) Avoid triggers, materialized views and process as much
as possible before insertion Monitor insertion rate Now, 20 ms per entry
• Thanks to the CERN DBA for their support and suggestions
SSB DATA FLOW SEP 2011
Savannah
Latest results
Historical values
COLLECTORS
Free text
BDII
Job Eff.
…
Topology
Load Data
TMP FILES
UI IMPROVEMENTS
• Better graphics• Filtering, sorting and pagination• Exporting data• Client side plotting
CURRENT CHALLENGES (SSB)
• Steadily growing amount of data Aggregate? Decrease granularity for older values? NoSQL?
DASHBOARD DATA MINING (EXAMPLE)
• Exit code which is generated in case of job failure does not always allow to identify the cause of the problem
• Data mining technique called association rule mining was applied on the collected job monitoring data in order to identify the cause of the job failure
• Within the Dashboard framework Quick Analysis Of Error Sources (QAOES) application was developed by a PhD student
• Logically two steps: identifying a problem and then providing previously collected human expertise about possible solutions to the detected problem. Info is merged and exposed through the UI
• Application ran for a year or so for CMS. Needed active evaluation and contribution from the experiment in order to make something really useful out of it . Unfortunately it did not happen
DASHBOARD CLUSTER
• 50 machines 16 physical, quattor, PES control (4 SLC4 ) 28 virtual, quattor, PES control 2 temporary virtual, PES control 4 virtual, only for internal tests
• Used by IT-ES group (not only dashboard)• Quattor templates for common components (iptables,
certificates, web servers, yum) Every machine has some manual configuration
• MonALISA monitoring Host monitor, web server, collectors, rpm, alarms… Still to configure automatic actions
• Wiki describing typical actions https://twiki.cern.ch/twiki/bin/view/ArdaGrid/Dashboard#Dashboard_Machines_Overview
DASHBOARD CLUSTER MONITORING DISPLAY
CMS DATA POPULARITY• The CMS Popularity project is a monitoring service of the data access
patterns• Technology
CRAB/Dashboard to collect the file based information from the CMS user jobs running on the grid
Python demon to harvest the above information and populate an Oracle database backend
Oracle Materialized Views for daily based data aggregation A Web UI, developed using the Django web framework and jQuery, exposes the
popularity metrics (historical views, aggregated views) in terms of tables, plots, JSON API
• Scale Collected information of more than 300k files/day Harvesting time needed by the demon: ~40’/day Refreshing time of the Materialized views: ~1’/day
Monitoring TF – IT/ES
CMS SITE CLEANING AGENT
• CMS Site Cleaning Agent: Implements the strategies to free up space at T2s.
• Technology Python-based application CMS Popularity and PhEDEx information accessed via HTTP JSON
APIs Expose results via the CMS Popularity web UI
• Scale Runs once a day: processing time ~2h Monitors the disk space of O(50) CMS T2 sites, and O(20) physics
groups, looking for sites/groups over quota O(200k) data blocks checked per run
Monitoring TF – IT/ES
HAMMER CLOUD Technology: - HC is a "ganga application": a python service which uses Ganga tosubmit grid jobs to Glite WMS, PanDA, CRAB, and DIRAC backends. - State is recorded in a MySQL database; plan to develop Oracle backend. - HC provides a Django-based web frontend, developed withJSON/jQuery UI elements.
Scale: - HC runs three instances for ATLAS, CMS, and LHCb. ~60 useraccounts (mostly grid site administrators) - Total of ~10-20,000 jobs per day. Each job is very short -- justto test the basic grid analysis workflows. - History is kept for all test jobs -- currently the DBs contactrecords for ~30 million job records.
USING SLS FOR ATLAS DISTRIBUTED COMPUTING (ADC)
• In ADC 10 critical services are monitored. Each service has between 1 and 10 service instances. - Metrics to calculate availability are gathered using lemon, webalizer and service specific reports.
• - In addition ADC has a SLS based T1 storage space monitoring (around 40 spacetokens). Storage space information is retrieved using lcg-utils. LHCb has a very similar implementation.
• - Information is monitored by ADC shifters that are instructed to report immediately to the ATLAS Manager on Duty in case a service is degraded.
CORAL APPLICATION MONITORING
• Monitor DB connections, queries, transactions performed by a CORAL client application Fixing and enhancing this feature (it has existed in CORAL since a
long time but was never really used) CMS wants to use it with the Oracle and Frontier plugins CORAL code is internally instrumented to keep track of DB
operations and dump them when the client job ends
• Also integrating this feature in the CORAL server ATLAS wants to use it to monitor DB operations in HLT Keep track of DB operations executed via CORAL server and make
them available in real time while server is up Eventually would also like to monitor packet traffic through the
hierarchy of CORAL server proxies caching HLT data
Monitoring TF – IT/ES
FRONTIER AND SQUID MONITORING
• Activity is performed within ATLAS With help from Frontier experts in CMS and in contact with the
CORAL team in IT-ES
• Aim is to provide service availability monitoring for the ATLAS distributed Frontier/Squid deployment Probing Squids via MRTG (shown per individual node on
frontier.cern.ch) – based on BDII, being moved to AGIS Probing Frontier via ping (shown for service only on SLS) Grep Frontier server logs (AWSTATS) – operational at some sites like
CERN and BNL, being deployed elsewhere
Monitoring TF – IT/ES
Top Related