Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical...

$: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.$
IT ES MONITORING APPLICATIONS

(TECHNOLOGY, SCALE, ISSUES)Julia Andreeva

DASHBOARD APPLICATIONS

U\

Monitoring of the job processing

Analysis Production Real time and historical

views

Users

Opera- tion teams

Sites

Data management monitoring

Data transfer Data access

Operation teams

Sites

Publicity & Dissemination

Site Status BoardSite usabilitySiteView

Sites

Operation teams

Infrastructure monitoring

SitesGeneralpublic

WLCG Google Earth Dashboard

COMMON SOLUTIONSApplication ATLAS CMS LHCb ALICE

Job monitoring(multiple applications)

Ѵ ѴSite Status Board Ѵ Ѵ Ѵ Ѵ SUM Ѵ Ѵ Ѵ Ѵ DDM Monitoring ѴSiteView &GoogleEarth Ѵ Ѵ Ѵ Ѵ




Global WLCG transfer monitor based on ATLAS DDM Dashboard is coming soon




All applications are shared by 2/3/4 experiments

All applications are developed in a common framework, which includes common construction blocks, build and

test environment, common module structure, agent management, common repository

DASHBOARD ARCHITECTURE

RECENT MODIFICATIONS

Machine-readable format(JSON)

Client-side Ajax and java

script based UI

Externalapplications

Dashboardframework

UI is completely agnostic regarding information source. Better flexibility. Adding new information source or replacing an existing one is a straight forward task.Clear decoupling of the development tasks

USER INTERFACES

• Over last months redesign Dashboard UIs• Client-side Model-View-Controller architecture• Using jQuery and AJAX• Full bookmarking support• Lot of effort was given to evaluation of the design of large-scale

JavaScript WEB applications, jQuery libraries. Experience is well documented. Set up recommendations for the developers:https://twiki.cern.ch/twiki/bin/view/ArdaGrid/Libshttp://code.google.com/p/hbrowse/w/list

Possibly a dedicated presentation can be of interest for the members of the monitoring group

https://twiki.cern.ch/twiki/bin/view/ArdaGrid/Libs

https://twiki.cern.ch/twiki/bin/view/ArdaGrid/Libs

http://code.google.com/p/hbrowse/w/list

http://code.google.com/p/hbrowse/w/list

JOB MONITORING

• Provides information about data processing in the scope of a given VO

• Mainly based on the instrumentation of the job submission frameworks and therefore works transparently across various middleware platforms (OSG, ARC,gLite), various submission methods (pilots, etc…), various execution backends (Grid, local)

• Merges information about a given job from multiple information sources (unique job identifier is a requirement)

• Job monitoring applications are shared by ATLAS and CMS• The db schema and user interfaces are shared, basically same

implementation but adapted for a particular experiment• The information sources and transport mechanisms are different ,

correspondingly different implementation for the collectors• Keep track of all processing details on a single job level

10

JOB MONITORING ARCHITECTURE

Dashboard Data Repository

One per experiment

(ORACLE)

Data retrieval via APIs

Jobs runningAt the WNs

Message server (MonAlisa or

MSG)

Dashboard consumer

User WEB interfaces

Job submission client or server

Dashboard web server

Panda DBATLAS

Production DB

JOB MONITORING DB

• Currently implemented in ORACLE• Schema is normalized• CMS schema is being partitioned twice per month, ATLAS

schema is partitioned weekly• Some interfaces use pre-cooked aggregated data• However some of them use raw data. Though a lot of tuning was done recently to improve performance of

the applications which use raw data, still there is a room for improvements.

Foresee to try nosql solutions to be used as a cash for the UIMain issue is eventual performance degradation due to the

instabilities of the execution plan. Hopefully the situation might improve with migration to 11g

SOME NUMBERS

• ATLAS submits up to 800K per day, CMS up to 300 K jobs per day => 1 million jobs per day to follow. Get regular updates of job status changes (per job)

• Per job DB contains time stamps of job status changes, meta information about jobs, job status, error codes and error reasons, job processing metrics (CPU, wallclock, memory consumption, etc…), list of accessed files (short time only)

• Plus aggregated information in summary tables with hourly and daily granularity.

• Size of ATLAS job monitoring DB 380GB for 1.5 year of data. Daily growth over last months 1-5GB/day

DATA COLLECTION (CMS)

• For historical reason in the lack of messaging system provided as a middleware component, for CMS use MonAlisa as a messaging system.

• Works well. Currently for CMS use 3 ML servers. In order to scale can add more servers. ML can accept up to 5K messages per second. The bottleneck is rather on the level of data recording to the DB. Which is being constantly monitored.

• Below are plots for one of the servers , the one used the most. One bar corresponds to 5 minutes (1 collector loop).

~20K status update records are inserted every 5 minutes from a single server that is 50-100 Hz from a single server.In case of any delay in information update alarm is sent. Alarm is triggered by ORACLE scheduled job

DATA COLLECTION (ATLAS)

• For ATLAS main data flow comes from PANDA (100-150 Hz). • Single server deals well with the load.• Collector loop runs every 2 minutes• Had performance issues with the first collector

implementation. Collector redesign allowed to solve the problem. Mainly replacing triggers by stored procedures which are called from the collector main thread

Thanks to the CERN DBA for their support and suggestions• Same as for CMS, performance is constantly monitored and

alarms are sent in case of any delay

DATA ACCESS• All UI applications are run in parallel on two web servers

(behind the same alias). But no real load balancing. Would like to try to re-use what is used for ActiveMQ message brokers

• Access is monitored. It is steadily growing in terms of number of users, frequency of access and volume of accessed data.

• Awstats for a single CMS server (metrics should be multiplied by two)

Monthly access patterns:~ 3-4 k unique visitors (IP addresses)~ 2-3 M pages~ 300-400 GB bandwidth

ATLAS DDM DASHBOARD

Monitoring ATLAS DDM Data Registrations and Transfers

Statistics

Events

Web UI & API Server & Agents Database Consumers

CONSUMERS AND STATISTICS GENERATION

• 2 consumers (Apache) receive callbacks events from 11 DDM Site Service VO boxes. (~ 50 Hz)

• Callback events stored in monthly partitioned database (Oracle) tables and kept for at least 3 months.

• Statistics generation agents (Dashboard Agent) run every 10 minutes generating statistics into 10 minute bins by source/destination/activity. (~ 50 k records per day)

• Statistics aggregation agents (Dashboard Agent) run every 10 minutes aggregating statistics into 24 hour bins by source/destination/activity. (~ 4 k records per day)

• Statistics stored in monthly partitioned database (Oracle) tables and kept indefinitely.

• Size of the DB is 1625GB, the biggest one of all Dashboard DBs. Daily growth over last months 1-5GB/day

WEB UI AND WEB API

• Statistics and event details available via web UI and web API• Web API (Dashboard Web) provides CSV/XML/JSON formatsWeb UI (Dashboard Web + AJAX/jQuery) provides highly

flexible filtering for statistics matrix and plots• Monthly access patterns:

~ 1 k unique visitors ~ 20 M page hits ~ 400 GB bandwidth > 90 % traffic for web API (50 % single user)

PAST SCALE ISSUES

• Deadlocks in DB due to too many connections from consumers Solution: Restrict number of connection pools (Apache thread

model) and connection pool size (Dashboard)

• Some publishers monopolise consumer due to diverse latency Solution: Additional consumer for high latency publishers

• Statistics generation procedures run too slow Solution: Split procedures to run in parallel and use bulk SQL (Oracle)

• Web UI and API queries for extended time periods too slow Solution: Aggregate statistics into 24 hour bins in separate DB tables

• Web server memory usage too high Solution: Generate plots on the client (HighCharts)

CURRENT SCALE ISSUES

• Occasional execution plan instabilities Plan: Investigate Oracle 11g SQL plan management to improve

stability Many thanks to DBAs for support in fixing instabilities when they

occur

• High load on web API from a few clients Plan: Work with users to develop more efficient API that meets their

requirements

• Consumers are approaching their load limit Plan: Investigate message brokering (ActiveMQ) as buffer to simplify

bulk inserts

SITE STATUS BOARD (SSB)

• Deployed for the 4 experiments• Gather metrics for all entities

Metrics are defined by experiment. Can be dynamically created Originally, entity=site. Now, entity could be ‘channel’ Measurement= Start/End Time, Value, Color, Site, URL More than 370 metrics for all experiments:

• Metrics gathered by collectors• Refresh rate between 10 min to 7 days

• Present latest state and historical information• Present different views (view is a set of metrics)

More than 40 views• Different ORACLE databases per experiment

CMS: 87 M entries, 20 collectors; ATLAS: 50 M, 3 collectors

SSB DATA FLOW JAN 2011

SavannahLatest results

COLLECTORS

Free text

BDII

Job Eff.

…

Topology

Historical data

COLLECTOR IMPROVEMENTS

• Too many different writers (LOCKING)! Use temporary files, and a single writer

• Huge table for historical values Partition by hash of metric and time

• Insertion rate too slow (1 second/entry) Avoid triggers, materialized views and process as much

as possible before insertion Monitor insertion rate Now, 20 ms per entry

• Thanks to the CERN DBA for their support and suggestions

SSB DATA FLOW SEP 2011

Savannah

Latest results

Historical values

COLLECTORS

Free text

BDII

Job Eff.

…

Topology

Load Data

TMP FILES

UI IMPROVEMENTS

• Better graphics• Filtering, sorting and pagination• Exporting data• Client side plotting

CURRENT CHALLENGES (SSB)

• Steadily growing amount of data Aggregate? Decrease granularity for older values? NoSQL?

DASHBOARD DATA MINING (EXAMPLE)

• Exit code which is generated in case of job failure does not always allow to identify the cause of the problem

• Data mining technique called association rule mining was applied on the collected job monitoring data in order to identify the cause of the job failure

• Within the Dashboard framework Quick Analysis Of Error Sources (QAOES) application was developed by a PhD student

• Logically two steps: identifying a problem and then providing previously collected human expertise about possible solutions to the detected problem. Info is merged and exposed through the UI

• Application ran for a year or so for CMS. Needed active evaluation and contribution from the experiment in order to make something really useful out of it . Unfortunately it did not happen

DASHBOARD CLUSTER

• 50 machines 16 physical, quattor, PES control (4 SLC4 ) 28 virtual, quattor, PES control 2 temporary virtual, PES control 4 virtual, only for internal tests

• Used by IT-ES group (not only dashboard)• Quattor templates for common components (iptables,

certificates, web servers, yum) Every machine has some manual configuration

• MonALISA monitoring Host monitor, web server, collectors, rpm, alarms… Still to configure automatic actions

• Wiki describing typical actions https://twiki.cern.ch/twiki/bin/view/ArdaGrid/Dashboard#Dashboard_Machines_Overview

https://twiki.cern.ch/twiki/bin/view/ArdaGrid/Dashboard#Dashboard_Machines_Overview

DASHBOARD CLUSTER MONITORING DISPLAY

CMS DATA POPULARITY• The CMS Popularity project is a monitoring service of the data access

patterns• Technology

CRAB/Dashboard to collect the file based information from the CMS user jobs running on the grid

Python demon to harvest the above information and populate an Oracle database backend

Oracle Materialized Views for daily based data aggregation A Web UI, developed using the Django web framework and jQuery, exposes the

popularity metrics (historical views, aggregated views) in terms of tables, plots, JSON API

• Scale Collected information of more than 300k files/day Harvesting time needed by the demon: ~40’/day Refreshing time of the Materialized views: ~1’/day

Monitoring TF – IT/ES

CMS SITE CLEANING AGENT

• CMS Site Cleaning Agent: Implements the strategies to free up space at T2s.

• Technology Python-based application CMS Popularity and PhEDEx information accessed via HTTP JSON

APIs Expose results via the CMS Popularity web UI

• Scale Runs once a day: processing time ~2h Monitors the disk space of O(50) CMS T2 sites, and O(20) physics

groups, looking for sites/groups over quota O(200k) data blocks checked per run


HAMMER CLOUD Technology: - HC is a "ganga application": a python service which uses Ganga tosubmit grid jobs to Glite WMS, PanDA, CRAB, and DIRAC backends. - State is recorded in a MySQL database; plan to develop Oracle backend. - HC provides a Django-based web frontend, developed withJSON/jQuery UI elements.

Scale: - HC runs three instances for ATLAS, CMS, and LHCb. ~60 useraccounts (mostly grid site administrators) - Total of ~10-20,000 jobs per day. Each job is very short -- justto test the basic grid analysis workflows. - History is kept for all test jobs -- currently the DBs contactrecords for ~30 million job records.

USING SLS FOR ATLAS DISTRIBUTED COMPUTING (ADC)

• In ADC 10 critical services are monitored. Each service has between 1 and 10 service instances. - Metrics to calculate availability are gathered using lemon, webalizer and service specific reports.

• - In addition ADC has a SLS based T1 storage space monitoring (around 40 spacetokens). Storage space information is retrieved using lcg-utils. LHCb has a very similar implementation.

• - Information is monitored by ADC shifters that are instructed to report immediately to the ATLAS Manager on Duty in case a service is degraded.

CORAL APPLICATION MONITORING

• Monitor DB connections, queries, transactions performed by a CORAL client application Fixing and enhancing this feature (it has existed in CORAL since a

long time but was never really used) CMS wants to use it with the Oracle and Frontier plugins CORAL code is internally instrumented to keep track of DB

operations and dump them when the client job ends

• Also integrating this feature in the CORAL server ATLAS wants to use it to monitor DB operations in HLT Keep track of DB operations executed via CORAL server and make

them available in real time while server is up Eventually would also like to monitor packet traffic through the

hierarchy of CORAL server proxies caching HLT data


FRONTIER AND SQUID MONITORING

• Activity is performed within ATLAS With help from Frontier experts in CMS and in contact with the

CORAL team in IT-ES

• Aim is to provide service availability monitoring for the ATLAS distributed Frontier/Squid deployment Probing Squids via MRTG (shown per individual node on

frontier.cern.ch) – based on BDII, being moved to AGIS Probing Frontier via ping (shown for service only on SLS) Grep Frontier server logs (AWSTATS) – operational at some sites like

CERN and BNL, being deployed elsewhere


Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical...

Documents

Transcript of Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical...