1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management...

1

DIRAC – LHCb MC production system

A.Tsaregorodtsev, CPPM, Marseille

For the LHCb Data Management team

CHEP, La Jolla25 March 2003

2

Outline

Introduction DIRAC architecture Implementation details Deploying DIRAC on the DataGRID Conclusions

3

What is it all about ?

Distributed MC production system for LHCb Production tasks definition and steering; Software installation on production sites; Job scheduling and monitoring; Data transfers and bookkeeping.

Automates most of the production tasks Minimum participation of local production managers

PULL rather than PUSH concept for jobs scheduling

DIRAC – Distributed Infrastructure with Remote Agent Control

4

Bookkeeping dataMonitoring info

Get jobs

Site ASite A

Site BSite B

Site CSite C

Site DSite D

SW agent

Production service

Monitoring serviceBookkeeping service

SW agentSW agent

SW agent

DIRAC architecture

5

Advantages of the PULL approach

Better use of resources no idle or forgotten CPU power; natural load balancing – more powerful center gets more

work automatically.

Less burden on the central production service deals only with production tasks definition and bookkeeping; do not bother about particular production sites.

No direct access to local disks from central service Easy introduction of new sites into the production

system no information on local sites necessary at the central site.

6

Job description

Gauss - v5

GenTag v7

Gauss - v5

Brunel - v12

Gauss - v5Gauss - v5

Brunel - v12

Pythia – v2

Workflow description

+- Event type- Application options- Number of events- Execution mode- Destination site …

Production run description

XML job descriptions

Production manager

Production DB

Web based editors

7

Agent operations

Production agent

batch system

Production service

isQueueAvalable()

requestJob(queue)

SW distribution service

installPackage()

Monitoring service

submitJob(queue)

Bookkeeping service

setJobStatus(step 1)

setJobStatus(step 2)

setJobStatus(step n)

…

sendBookkeeping()

MassStorage

sendFileToCastor()

addReplica()

Run

ning

job

8

Implementation details

Central web services XML-RPC servers ; Web based editing and visualization ; ORACLE production and bookkeeping databases.

Agent - a set of collaborating python classes Python 1.5.2 to be sure it is compatible with all the sites ;

standard python library XML-RPC client ; The agent is running as a daemon process or as a cron job

on a production site. Easily extendable via plugins:

• for new applications ;• for new tools, e.g. file transport .

Data and log files transfer using bbftp ;

9

Agent customization at a production site

Easy setting up of a production site is crucial to absorb all available resources ;

One Python script where all the local configuration is defined : Interface to the local batch system; Interface to the local mass storage system;

Agent distribution comes with examples of typical cases “Standard” site can be configured in few minutes

• e.g., PBS + disk mass storage.

10

Dealing with failures

Job is rescheduled in case of a local system failure to run it Other sites can then pick it up.

Journaling all the sensitive files (logs, bookkeeping, job descriptions)

are kept at the production site caches. Job can be restarted from where it failed

Accomplished steps are not redone. File transfers are automatically retried after a

predefined pause in case of failures.

11

Working experience

DIRAC production system was deployed on 17 LHCb production sites : 2 hours to 2 days of work for customization.

Smooth running for MC production tasks ; Much less burden for local production managers :

automatic data upload to CERN/Castor ; log files automatically available through a Web page ; automatic recoveries from common failures (job submission,

data transfers) ; The current Data Challenge production using DIRAC

advances ahead of schedule ~1000 CPU’s in total used; 1M events produced per day.

12

Resource Broker

WN

WN

WN

DataGRID

Replica catalog

DIRAC on the DataGRID

Production service

Monitoring serviceBookkeeping service

Castor

DataGRID portal

job.xmlJDL

Replica manager

CERN SE

13

Deploying agents on the DataGRID

INPUT: JDL InputSandbox contains:

job XML description; agent launcher script:

OUTPUT: Use EDG replica_manager for data transfer to

CERN SE/Castor ; Log files are passed back via OutputSandbox .

> wget ‘http://…/distribution/dmsetup’> dmsetup --local DataGRID> shoot_agent job.xml

> wget ‘http://…/distribution/dmsetup’> dmsetup --local DataGRID> shoot_agent job.xml

14

Tests on the DataGRID testbed Standard LHCb production jobs were used for the tests :

Jobs of different statistics with 8 steps workflow. Jobs submitted to 4 EDG testbed Resource Brokers :

keeping ~50 jobs per broker ; Software installed for each job ;

Job type (hours) Total Success Success rate

Mini (0.2) 190 113 59%

Short (6) 171 102 59%

Medium (24) 1195 346 29%

Total 1556 561 36%

Total of ~300K events produced so far. This makes EDG testbed already a competitive LHCb production site.

15

Main problems

EDG middleware instability problems : MDS information system failures – “no matching resources found”; RB fails to get input files because of gridftp failures; Jobs stuck in some unfinished state:

• “Done”,”Resubmitted”,etc

Long jobs suffering from site misconfiguration: RB fails to find appropriate resources; Jobs hit the limits of the local batch system; “Estimated Traversal Time” failure as ranking criteria;

Software installation failures: Disk quotas; Forbidden outbound IP connections on WN’s on some sites.

16

Some lessons learnt

Needed an API for the software installation For experiments to install software:

• independently from site managers;• on per job basis if necessary.

For site managers to be sure the software is installed in an organized way.

Outbound IP connectivity should be available Needed for the software installation; Needed for jobs exchanging messages with

production services . Uniform site descriptions:

EDG uniform CPU unit ?

17

Conclusions

The DIRAC production system is routinely running in production now at ~17 sites ;

The PULL paradigm for jobs scheduling proved to be very successful ;

It is of great help for local production managers and a key for the success of the LHCb Data Challenge 2003 ;

The DataGRID testbed is integrated in the DIRAC production system, extensive tests are in progress .

1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management...

Documents

Transcript of 1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management...