A lightweight Monitoring and Accounting system for LHCb DC'04 production

19
1 A lightweight Monitoring and Accounting system for LHCb DC'04 production V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya Carrillo

description

A lightweight Monitoring and Accounting system for LHCb DC'04 production. V. Garonne R. Graciani Díaz J. J. Saborido Silva M. Sánchez García R. Vizcaya Carrillo. Outline. Manifesto Monitoring Web interface Internals Accounting Web interface Internals Outlook URLs. Manifesto. - PowerPoint PPT Presentation

Transcript of A lightweight Monitoring and Accounting system for LHCb DC'04 production

Page 1: A lightweight Monitoring and  Accounting system for LHCb DC'04 production

1

A lightweight Monitoring and Accounting system for LHCb DC'04 production

V. GaronneR. Graciani Díaz

J. J. Saborido SilvaM. Sánchez GarcíaR. Vizcaya Carrillo

Page 2: A lightweight Monitoring and  Accounting system for LHCb DC'04 production

2

Outline

Manifesto Monitoring

Web interface Internals

Accounting Web interface Internals

Outlook URLs

Page 3: A lightweight Monitoring and  Accounting system for LHCb DC'04 production

3

Manifesto

Monitoring and Accounting are tasks in DIRAC377

DIRAC is a Production grid for LHCb The Monitoring reports the status of jobs while

in the WMS (Workload Management System)366

Instantaneous snapshot of the system No historic records

The Accounting records the status of jobs after leaving the WMS Provides historic record, accumulated statistics

and evolution of recorded variables with time Main users: production and site managers

Page 4: A lightweight Monitoring and  Accounting system for LHCb DC'04 production

4

Design choices

Monitoring Job information stored centrally in the WMS

Info Provided directly by the job and the WMS Passive services: no pushpushing of information

No need for a common consumer API Job and Application state stored together

Accounting Separate infrastructure from the monitoring

Jobs can never be on the Accounting and the Monitoring

Domain specific: LHCb production jobs

Page 5: A lightweight Monitoring and  Accounting system for LHCb DC'04 production

5

Information Flow

WMS

Web interface Web interface

Job Database Accounting Database

Cleaner Agent

Accounting

Write Read

Monitoring

Read Write

Job

Use

rsB

ack

en

dS

erv

ices

& A

gen

ts

Job Heart-beat

DIRAC

Page 6: A lightweight Monitoring and  Accounting system for LHCb DC'04 production

6

Monitoring Web Interface 1 Interface to query monitoring service

JobId popup a window with job details if clicked

Page 7: A lightweight Monitoring and  Accounting system for LHCb DC'04 production

7

Monitoring Web Interface 2

The overview shows predefined plots on the production Generated

every few minutes

PyPyCCharthart used as graphics engine

100% python Supports SVG

Running jobs by site

Page 8: A lightweight Monitoring and  Accounting system for LHCb DC'04 production

8

Monitoring Web Interface 3 Job status by site and production id

Page 9: A lightweight Monitoring and  Accounting system for LHCb DC'04 production

9

Monitoring Internals

It consists of a XML-RPC service exposing whatever parameters are known to DIRAC

Job parameters stored internally by DIRAC Primary parameters

Execution site, job status, job owner etc. Fixed, centrally defined: fast access Can query on them

Secondary parameters Number of steps, internal job state, etc Defined by the production job itself Stored as key-value pairs Slower access. Cannot query on them

Page 10: A lightweight Monitoring and  Accounting system for LHCb DC'04 production

10

JMS basic API example

from xmlrpclib import ServerProxyserver = ServerProxy(monitoring_url)

#Retrieve list of jobs verifying some conditionsconditions = {'Status': 'running', 'Site': 'DIRAC.CERN.ch' }jobreq = server.getJobs(conditions)

#Print some parameters for each jobif jobreq['Status']: for jobid in jobreq['Value']: print server.getJobSite(jobid) print server.getJobParameter(jobid, 'LocalBatchId')

#Bulk operationssum = server.getJobsPrimarySummary(jobreq['Value'])

~3 s to select 95 out of 50k jobs

~0.7 s

~40 s

Page 11: A lightweight Monitoring and  Accounting system for LHCb DC'04 production

11

Accounting Web Interface 1

GUI for querying the Accounting

Shows results As graphics As table As Excel sheet

Several types of report Only a few shown

here

Page 12: A lightweight Monitoring and  Accounting system for LHCb DC'04 production

12

Accounting Web Interface 2

Used resources by site

Page 13: A lightweight Monitoring and  Accounting system for LHCb DC'04 production

13

Accounting Web Interface 3

Used resources by event type Mb/job CPU/job Failed jobs CPU vs. Exec

time Input and

Output data vs. CPU

Page 14: A lightweight Monitoring and  Accounting system for LHCb DC'04 production

14

Accounting Web Interface 4

Produced data by production ID Rates Cumulative Number of

events Gb of output

Page 15: A lightweight Monitoring and  Accounting system for LHCb DC'04 production

15

Accounting Web Interface 5

WMS statistics on DIRAC's performance Plots

Job execution time vs. WMS waiting time Job execution time vs. WMS matching time

Granularity Per site Per production Integral

Allows assessment of DIRAC's performance

Page 16: A lightweight Monitoring and  Accounting system for LHCb DC'04 production

16

Accounting Internals

Job and DIRAC statistics kept in a database Site contribution Data produced and used by jobs and steps Timing for jobs, steps and DIRAC internals

Separate XML-RPC interfaces to populate and query the accounting tables Both interfaces have restricted access

Jobs are moved to the accounting system by a cleaner agent after being validated

Page 17: A lightweight Monitoring and  Accounting system for LHCb DC'04 production

17

Accounting Usage About 10 hits per day Time to generate daily static reports:

8 min 60-70% of the time querying the

database 30-40% of the time in the drawing

packageServer load<0.2

Total: 169 kjobs

Page 18: A lightweight Monitoring and  Accounting system for LHCb DC'04 production

18

Outlook

Monitoring page Transactions in monitoring updates Further optimisation (bulk operations...) Search for a faster rendering package Make the web page dynamic: Less

reloads Accounting

New report types Normalized CPU Contribution by country Rate by site, country etc...

Page 19: A lightweight Monitoring and  Accounting system for LHCb DC'04 production

19

URLs

Monitoring page http://fpegaes1.usc.es/dmon/DC04/joblist.

html Mirror on:

http://lhcb02.usc.cesga.es/dmon/DC04/joblist.html

Direct link to overview pages http://lhcb.ecm.ub.es/DC04/Monitoring

Accounting page http://lhcb.ecm.ub.es/DC04/Accounting/