1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management...
-
Upload
june-payne -
Category
Documents
-
view
216 -
download
0
Transcript of 1 DIRAC – LHCb MC production system A.Tsaregorodtsev, CPPM, Marseille For the LHCb Data Management...
1
DIRAC – LHCb MC production system
A.Tsaregorodtsev, CPPM, Marseille
For the LHCb Data Management team
CHEP, La Jolla25 March 2003
2
Outline
Introduction DIRAC architecture Implementation details Deploying DIRAC on the DataGRID Conclusions
3
What is it all about ?
Distributed MC production system for LHCb Production tasks definition and steering; Software installation on production sites; Job scheduling and monitoring; Data transfers and bookkeeping.
Automates most of the production tasks Minimum participation of local production managers
PULL rather than PUSH concept for jobs scheduling
DIRAC – Distributed Infrastructure with Remote Agent Control
4
Bookkeeping dataMonitoring info
Get jobs
Site ASite A
Site BSite B
Site CSite C
Site DSite D
SW agent
Production service
Monitoring serviceBookkeeping service
SW agentSW agent
SW agent
DIRAC architecture
5
Advantages of the PULL approach
Better use of resources no idle or forgotten CPU power; natural load balancing – more powerful center gets more
work automatically.
Less burden on the central production service deals only with production tasks definition and bookkeeping; do not bother about particular production sites.
No direct access to local disks from central service Easy introduction of new sites into the production
system no information on local sites necessary at the central site.
6
Job description
Gauss - v5
GenTag v7
Gauss - v5
Brunel - v12
Gauss - v5Gauss - v5
Brunel - v12
Pythia – v2
Workflow description
+- Event type- Application options- Number of events- Execution mode- Destination site …
Production run description
XML job descriptions
Production manager
Production DB
Web based editors
7
Agent operations
Production agent
batch system
Production service
isQueueAvalable()
requestJob(queue)
SW distribution service
installPackage()
Monitoring service
submitJob(queue)
Bookkeeping service
setJobStatus(step 1)
setJobStatus(step 2)
setJobStatus(step n)
…
sendBookkeeping()
MassStorage
sendFileToCastor()
addReplica()
Run
ning
job
8
Implementation details
Central web services XML-RPC servers ; Web based editing and visualization ; ORACLE production and bookkeeping databases.
Agent - a set of collaborating python classes Python 1.5.2 to be sure it is compatible with all the sites ;
standard python library XML-RPC client ; The agent is running as a daemon process or as a cron job
on a production site. Easily extendable via plugins:
• for new applications ;• for new tools, e.g. file transport .
Data and log files transfer using bbftp ;
9
Agent customization at a production site
Easy setting up of a production site is crucial to absorb all available resources ;
One Python script where all the local configuration is defined : Interface to the local batch system; Interface to the local mass storage system;
Agent distribution comes with examples of typical cases “Standard” site can be configured in few minutes
• e.g., PBS + disk mass storage.
10
Dealing with failures
Job is rescheduled in case of a local system failure to run it Other sites can then pick it up.
Journaling all the sensitive files (logs, bookkeeping, job descriptions)
are kept at the production site caches. Job can be restarted from where it failed
Accomplished steps are not redone. File transfers are automatically retried after a
predefined pause in case of failures.
11
Working experience
DIRAC production system was deployed on 17 LHCb production sites : 2 hours to 2 days of work for customization.
Smooth running for MC production tasks ; Much less burden for local production managers :
automatic data upload to CERN/Castor ; log files automatically available through a Web page ; automatic recoveries from common failures (job submission,
data transfers) ; The current Data Challenge production using DIRAC
advances ahead of schedule ~1000 CPU’s in total used; 1M events produced per day.
12
Resource Broker
WN
WN
WN
DataGRID
Replica catalog
DIRAC on the DataGRID
Production service
Monitoring serviceBookkeeping service
Castor
DataGRID portal
job.xmlJDL
Replica manager
CERN SE
13
Deploying agents on the DataGRID
INPUT: JDL InputSandbox contains:
job XML description; agent launcher script:
OUTPUT: Use EDG replica_manager for data transfer to
CERN SE/Castor ; Log files are passed back via OutputSandbox .
> wget ‘http://…/distribution/dmsetup’> dmsetup --local DataGRID> shoot_agent job.xml
> wget ‘http://…/distribution/dmsetup’> dmsetup --local DataGRID> shoot_agent job.xml
14
Tests on the DataGRID testbed Standard LHCb production jobs were used for the tests :
Jobs of different statistics with 8 steps workflow. Jobs submitted to 4 EDG testbed Resource Brokers :
keeping ~50 jobs per broker ; Software installed for each job ;
Job type (hours) Total Success Success rate
Mini (0.2) 190 113 59%
Short (6) 171 102 59%
Medium (24) 1195 346 29%
Total 1556 561 36%
Total of ~300K events produced so far. This makes EDG testbed already a competitive LHCb production site.
15
Main problems
EDG middleware instability problems : MDS information system failures – “no matching resources found”; RB fails to get input files because of gridftp failures; Jobs stuck in some unfinished state:
• “Done”,”Resubmitted”,etc
Long jobs suffering from site misconfiguration: RB fails to find appropriate resources; Jobs hit the limits of the local batch system; “Estimated Traversal Time” failure as ranking criteria;
Software installation failures: Disk quotas; Forbidden outbound IP connections on WN’s on some sites.
16
Some lessons learnt
Needed an API for the software installation For experiments to install software:
• independently from site managers;• on per job basis if necessary.
For site managers to be sure the software is installed in an organized way.
Outbound IP connectivity should be available Needed for the software installation; Needed for jobs exchanging messages with
production services . Uniform site descriptions:
EDG uniform CPU unit ?
17
Conclusions
The DIRAC production system is routinely running in production now at ~17 sites ;
The PULL paradigm for jobs scheduling proved to be very successful ;
It is of great help for local production managers and a key for the success of the LHCb Data Challenge 2003 ;
The DataGRID testbed is integrated in the DIRAC production system, extensive tests are in progress .