Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical...

35
IT ES MONITORING APPLICATIONS (TECHNOLOGY, SCALE, ISSUES) Julia Andreeva

Transcript of Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical...

Page 1: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

IT ES MONITORING APPLICATIONS

(TECHNOLOGY, SCALE, ISSUES)Julia Andreeva

Page 2: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

DASHBOARD APPLICATIONS

U\

Monitoring of the job processing

Analysis Production Real time and historical

views

Users

Opera- tion teams

Sites

Data management monitoring

Data transfer Data access

Operation teams

Sites

Publicity & Dissemination

Site Status BoardSite usabilitySiteView

Sites

Operation teams

Infrastructure monitoring

SitesGeneralpublic

WLCG Google Earth Dashboard

Page 3: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

COMMON SOLUTIONSApplication ATLAS CMS LHCb ALICE

Job monitoring(multiple applications)

Ѵ ѴSite Status Board Ѵ Ѵ Ѵ Ѵ SUM Ѵ Ѵ Ѵ Ѵ DDM Monitoring ѴSiteView &GoogleEarth Ѵ Ѵ Ѵ Ѵ

Page 4: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

COMMON SOLUTIONSApplication ATLAS CMS LHCb ALICE

Job monitoring(multiple applications)

Ѵ ѴSite Status Board Ѵ Ѵ Ѵ Ѵ SUM Ѵ Ѵ Ѵ Ѵ DDM Monitoring ѴSiteView &GoogleEarth Ѵ Ѵ Ѵ Ѵ

Global WLCG transfer monitor based on ATLAS DDM Dashboard is coming soon

Page 5: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

COMMON SOLUTIONSApplication ATLAS CMS LHCb ALICE

Job monitoring(multiple applications)

Ѵ ѴSite Status Board Ѵ Ѵ Ѵ Ѵ SUM Ѵ Ѵ Ѵ Ѵ DDM Monitoring ѴSiteView &GoogleEarth Ѵ Ѵ Ѵ Ѵ

All applications are shared by 2/3/4 experiments

All applications are developed in a common framework, which includes common construction blocks, build and

test environment, common module structure, agent management, common repository

Page 6: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

DASHBOARD ARCHITECTURE

Page 7: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

RECENT MODIFICATIONS

Machine-readable format(JSON)

Client-side Ajax and java

script based UI

Externalapplications

Dashboardframework

UI is completely agnostic regarding information source. Better flexibility. Adding new information source or replacing an existing one is a straight forward task.Clear decoupling of the development tasks

Page 8: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

USER INTERFACES

• Over last months redesign Dashboard UIs• Client-side Model-View-Controller architecture• Using jQuery and AJAX• Full bookmarking support• Lot of effort was given to evaluation of the design of large-scale

JavaScript WEB applications, jQuery libraries. Experience is well documented. Set up recommendations for the developers:https://twiki.cern.ch/twiki/bin/view/ArdaGrid/Libshttp://code.google.com/p/hbrowse/w/list

Possibly a dedicated presentation can be of interest for the members of the monitoring group

Page 9: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

JOB MONITORING

• Provides information about data processing in the scope of a given VO

• Mainly based on the instrumentation of the job submission frameworks and therefore works transparently across various middleware platforms (OSG, ARC,gLite), various submission methods (pilots, etc…), various execution backends (Grid, local)

• Merges information about a given job from multiple information sources (unique job identifier is a requirement)

• Job monitoring applications are shared by ATLAS and CMS• The db schema and user interfaces are shared, basically same

implementation but adapted for a particular experiment• The information sources and transport mechanisms are different ,

correspondingly different implementation for the collectors• Keep track of all processing details on a single job level

Page 10: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

10

JOB MONITORING ARCHITECTURE

Dashboard Data Repository

One per experiment

(ORACLE)

Data retrieval via APIs

Jobs runningAt the WNs

Message server (MonAlisa or

MSG)

Dashboard consumer

User WEB interfaces

Job submission client or server

Dashboard web server

Panda DBATLAS

Production DB

Page 11: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

JOB MONITORING DB

• Currently implemented in ORACLE• Schema is normalized• CMS schema is being partitioned twice per month, ATLAS

schema is partitioned weekly• Some interfaces use pre-cooked aggregated data• However some of them use raw data. Though a lot of tuning was done recently to improve performance of

the applications which use raw data, still there is a room for improvements.

Foresee to try nosql solutions to be used as a cash for the UIMain issue is eventual performance degradation due to the

instabilities of the execution plan. Hopefully the situation might improve with migration to 11g

Page 12: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

SOME NUMBERS

• ATLAS submits up to 800K per day, CMS up to 300 K jobs per day => 1 million jobs per day to follow. Get regular updates of job status changes (per job)

• Per job DB contains time stamps of job status changes, meta information about jobs, job status, error codes and error reasons, job processing metrics (CPU, wallclock, memory consumption, etc…), list of accessed files (short time only)

• Plus aggregated information in summary tables with hourly and daily granularity.

• Size of ATLAS job monitoring DB 380GB for 1.5 year of data. Daily growth over last months 1-5GB/day

Page 13: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

DATA COLLECTION (CMS)

• For historical reason in the lack of messaging system provided as a middleware component, for CMS use MonAlisa as a messaging system.

• Works well. Currently for CMS use 3 ML servers. In order to scale can add more servers. ML can accept up to 5K messages per second. The bottleneck is rather on the level of data recording to the DB. Which is being constantly monitored.

• Below are plots for one of the servers , the one used the most. One bar corresponds to 5 minutes (1 collector loop).

~20K status update records are inserted every 5 minutes from a single server that is 50-100 Hz from a single server.In case of any delay in information update alarm is sent. Alarm is triggered by ORACLE scheduled job

Page 14: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

DATA COLLECTION (ATLAS)

• For ATLAS main data flow comes from PANDA (100-150 Hz). • Single server deals well with the load.• Collector loop runs every 2 minutes• Had performance issues with the first collector

implementation. Collector redesign allowed to solve the problem. Mainly replacing triggers by stored procedures which are called from the collector main thread

Thanks to the CERN DBA for their support and suggestions• Same as for CMS, performance is constantly monitored and

alarms are sent in case of any delay

Page 15: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

DATA ACCESS• All UI applications are run in parallel on two web servers

(behind the same alias). But no real load balancing. Would like to try to re-use what is used for ActiveMQ message brokers

• Access is monitored. It is steadily growing in terms of number of users, frequency of access and volume of accessed data.

• Awstats for a single CMS server (metrics should be multiplied by two)

Monthly access patterns:~ 3-4 k unique visitors (IP addresses)~ 2-3 M pages~ 300-400 GB bandwidth

Page 16: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

ATLAS DDM DASHBOARD

Monitoring ATLAS DDM Data Registrations and Transfers

Statistics

Events

Web UI & API Server & Agents Database Consumers

Page 17: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

CONSUMERS AND STATISTICS GENERATION

• 2 consumers (Apache) receive callbacks events from 11 DDM Site Service VO boxes. (~ 50 Hz)

• Callback events stored in monthly partitioned database (Oracle) tables and kept for at least 3 months.

• Statistics generation agents (Dashboard Agent) run every 10 minutes generating statistics into 10 minute bins by source/destination/activity. (~ 50 k records per day)

• Statistics aggregation agents (Dashboard Agent) run every 10 minutes aggregating statistics into 24 hour bins by source/destination/activity. (~ 4 k records per day)

• Statistics stored in monthly partitioned database (Oracle) tables and kept indefinitely.

• Size of the DB is 1625GB, the biggest one of all Dashboard DBs. Daily growth over last months 1-5GB/day

Page 18: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

WEB UI AND WEB API

• Statistics and event details available via web UI and web API• Web API (Dashboard Web) provides CSV/XML/JSON formatsWeb UI (Dashboard Web + AJAX/jQuery) provides highly

flexible filtering for statistics matrix and plots• Monthly access patterns:

~ 1 k unique visitors ~ 20 M page hits ~ 400 GB bandwidth > 90 % traffic for web API (50 % single user)

Page 19: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

PAST SCALE ISSUES

• Deadlocks in DB due to too many connections from consumers Solution: Restrict number of connection pools (Apache thread

model) and connection pool size (Dashboard)

• Some publishers monopolise consumer due to diverse latency Solution: Additional consumer for high latency publishers

• Statistics generation procedures run too slow Solution: Split procedures to run in parallel and use bulk SQL (Oracle)

• Web UI and API queries for extended time periods too slow Solution: Aggregate statistics into 24 hour bins in separate DB tables

• Web server memory usage too high Solution: Generate plots on the client (HighCharts)

Page 20: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

CURRENT SCALE ISSUES

• Occasional execution plan instabilities Plan: Investigate Oracle 11g SQL plan management to improve

stability Many thanks to DBAs for support in fixing instabilities when they

occur

• High load on web API from a few clients Plan: Work with users to develop more efficient API that meets their

requirements

• Consumers are approaching their load limit Plan: Investigate message brokering (ActiveMQ) as buffer to simplify

bulk inserts

Page 21: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

SITE STATUS BOARD (SSB)

• Deployed for the 4 experiments• Gather metrics for all entities

Metrics are defined by experiment. Can be dynamically created Originally, entity=site. Now, entity could be ‘channel’ Measurement= Start/End Time, Value, Color, Site, URL More than 370 metrics for all experiments:

• Metrics gathered by collectors• Refresh rate between 10 min to 7 days

• Present latest state and historical information• Present different views (view is a set of metrics)

More than 40 views• Different ORACLE databases per experiment

CMS: 87 M entries, 20 collectors; ATLAS: 50 M, 3 collectors

Page 22: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

SSB DATA FLOW JAN 2011

SavannahLatest results

COLLECTORS

Free text

BDII

Job Eff.

Topology

Historical data

Page 23: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

COLLECTOR IMPROVEMENTS

• Too many different writers (LOCKING)! Use temporary files, and a single writer

• Huge table for historical values Partition by hash of metric and time

• Insertion rate too slow (1 second/entry) Avoid triggers, materialized views and process as much

as possible before insertion Monitor insertion rate Now, 20 ms per entry

• Thanks to the CERN DBA for their support and suggestions

Page 24: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

SSB DATA FLOW SEP 2011

Savannah

Latest results

Historical values

COLLECTORS

Free text

BDII

Job Eff.

Topology

Load Data

TMP FILES

Page 25: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

UI IMPROVEMENTS

• Better graphics• Filtering, sorting and pagination• Exporting data• Client side plotting

Page 26: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

CURRENT CHALLENGES (SSB)

• Steadily growing amount of data Aggregate? Decrease granularity for older values? NoSQL?

Page 27: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

DASHBOARD DATA MINING (EXAMPLE)

• Exit code which is generated in case of job failure does not always allow to identify the cause of the problem

• Data mining technique called association rule mining was applied on the collected job monitoring data in order to identify the cause of the job failure

• Within the Dashboard framework Quick Analysis Of Error Sources (QAOES) application was developed by a PhD student

• Logically two steps: identifying a problem and then providing previously collected human expertise about possible solutions to the detected problem. Info is merged and exposed through the UI

• Application ran for a year or so for CMS. Needed active evaluation and contribution from the experiment in order to make something really useful out of it . Unfortunately it did not happen

Page 28: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

DASHBOARD CLUSTER

• 50 machines 16 physical, quattor, PES control (4 SLC4 ) 28 virtual, quattor, PES control 2 temporary virtual, PES control 4 virtual, only for internal tests

• Used by IT-ES group (not only dashboard)• Quattor templates for common components (iptables,

certificates, web servers, yum) Every machine has some manual configuration

• MonALISA monitoring Host monitor, web server, collectors, rpm, alarms… Still to configure automatic actions

• Wiki describing typical actions https://twiki.cern.ch/twiki/bin/view/ArdaGrid/Dashboard#Dashboard_Machines_Overview

Page 29: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

DASHBOARD CLUSTER MONITORING DISPLAY

Page 30: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

CMS DATA POPULARITY• The CMS Popularity project is a monitoring service of the data access

patterns• Technology

CRAB/Dashboard to collect the file based information from the CMS user jobs running on the grid

Python demon to harvest the above information and populate an Oracle database backend

Oracle Materialized Views for daily based data aggregation A Web UI, developed using the Django web framework and jQuery, exposes the

popularity metrics (historical views, aggregated views) in terms of tables, plots, JSON API

• Scale Collected information of more than 300k files/day Harvesting time needed by the demon: ~40’/day Refreshing time of the Materialized views: ~1’/day

Monitoring TF – IT/ES

Page 31: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

CMS SITE CLEANING AGENT

• CMS Site Cleaning Agent: Implements the strategies to free up space at T2s.

• Technology Python-based application CMS Popularity and PhEDEx information accessed via HTTP JSON

APIs Expose results via the CMS Popularity web UI

• Scale Runs once a day: processing time ~2h Monitors the disk space of O(50) CMS T2 sites, and O(20) physics

groups, looking for sites/groups over quota O(200k) data blocks checked per run

Monitoring TF – IT/ES

Page 32: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

HAMMER CLOUD Technology: - HC is a "ganga application": a python service which uses Ganga tosubmit grid jobs to Glite WMS, PanDA, CRAB, and DIRAC backends. - State is recorded in a MySQL database; plan to develop Oracle backend. - HC provides a Django-based web frontend, developed withJSON/jQuery UI elements.

Scale: - HC runs three instances for ATLAS, CMS, and LHCb. ~60 useraccounts (mostly grid site administrators) - Total of ~10-20,000 jobs per day. Each job is very short -- justto test the basic grid analysis workflows. - History is kept for all test jobs -- currently the DBs contactrecords for ~30 million job records.

Page 33: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

USING SLS FOR ATLAS DISTRIBUTED COMPUTING (ADC)

• In ADC 10 critical services are monitored. Each service has between 1 and 10 service instances. - Metrics to calculate availability are gathered using lemon, webalizer and service specific reports.

• - In addition ADC has a SLS based T1 storage space monitoring (around 40 spacetokens). Storage space information is retrieved using lcg-utils. LHCb has a very similar implementation.

• - Information is monitored by ADC shifters that are instructed to report immediately to the ATLAS Manager on Duty in case a service is degraded.

Page 34: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

CORAL APPLICATION MONITORING

• Monitor DB connections, queries, transactions performed by a CORAL client application Fixing and enhancing this feature (it has existed in CORAL since a

long time but was never really used) CMS wants to use it with the Oracle and Frontier plugins CORAL code is internally instrumented to keep track of DB

operations and dump them when the client job ends

• Also integrating this feature in the CORAL server ATLAS wants to use it to monitor DB operations in HLT Keep track of DB operations executed via CORAL server and make

them available in real time while server is up Eventually would also like to monitor packet traffic through the

hierarchy of CORAL server proxies caching HLT data

Monitoring TF – IT/ES

Page 35: Julia Andreeva. \ Monitoring of the job processing Analysis Production Real time and historical views Users Opera- tion teams Sites Data management monitoring.

FRONTIER AND SQUID MONITORING

• Activity is performed within ATLAS With help from Frontier experts in CMS and in contact with the

CORAL team in IT-ES

• Aim is to provide service availability monitoring for the ATLAS distributed Frontier/Squid deployment Probing Squids via MRTG (shown per individual node on

frontier.cern.ch) – based on BDII, being moved to AGIS Probing Frontier via ping (shown for service only on SLS) Grep Frontier server logs (AWSTATS) – operational at some sites like

CERN and BNL, being deployed elsewhere

Monitoring TF – IT/ES