Patricia Méndez Lorenzo (IT/GS) ALICE Offline Week (18th March 2009)
-
Upload
vivien-newman -
Category
Documents
-
view
214 -
download
1
Transcript of Patricia Méndez Lorenzo (IT/GS) ALICE Offline Week (18th March 2009)
IntroductionALICE is interested in the deployment of the CREAM-CE
service at all sites which provide support to the experiment GOAL: Deprecation of the WMS use in benefit of the direct
CREAM-CE submission WMS submission mode to CREAM-CE not required
ALICE has began to test the CREAM-CE since the beginning of Summer 2008 into the real production environment
For the time being, ALICE is the only LHC experiment performing stress and real tests to the CREAM-CE
This talk will focus on the ALICE experiences using CREAM-CE, the expectations, future plans and
requirements for all the sites
18/03/09 2ALICE Offline Week -- CREAM-CE Use and Status for ALICE
The CREAM-CECREAM (Computing Resource Execution And
Management) lightweight service for job management operations at the CE level
Called to be the replacement of the current LCG-CESubmission procedures allowed by CREAM:
Submissions to CREAM via WMS Via generic clients which allow direct submission
The submission method depends basically on the experiment computing modelNormally pilot based follows the direct submission mode
approach (4 LHC experiments)Bulk submissions of real jobs follows the WMS submission
approach (CMS)
18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 3
Direct Submission to CREAM-CEExtra elements required for direct submission
Proxy renewal mechanism (required by CMS and ATLAS) Responsible to automatically renew the user proxy if expiring Already (recently) available
The lack of this element is not a showstop for ALICE48h voms extensions ensured by the security team@CERNEnough to run production/analysis jobs without any addition
extension
18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 4
The 1st test phase Performed in summer 2008 at FZK (T1 site, Germany)
Tests operated through a second VOBOX parallel to the already existing service at the T1 (operating in WMS submission mode)
Access to the local CREAM-CE was ensured through the PPS infrastructure Initially 30 CPUs Moved to the ALICE production queue in few weeks (production setup)
Intensive functionality and stability tests from July to September 2008 Production stopped to create and ALICE CREAM module into AliEn and to
allow the site to upgrade their system Excellent support from the CREAM-CE developers and the site admins
Specially Massimo Sgaravatto (INFN-Padova) and Angela Poschlad (GridKa T1 site)
18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 5
Results of the 1st test phaseMore than 55000 jobs successfully executed through the
CREAM-CE in the mentioned periodNo interventionsin the VOBOX required in the testing phaseCREAM-CE usedto distribute real(standard) ALICE jobs
18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 6
Running on the production queue
Running on PPS nodes
Implementation into AliEn (I)Creation of a new CREAM module
Specific for CREAM-CE submissionsAvailable since AliEn v2-16In parallel with the usual LCG module (restricted to WMS
submissions only)Change on the jdl construction
The current ALICE jdl contained the outputsandbox field which specifies the standard outputs of the job agents CREAM-CE requires a new jdl field which declares the gridftp
server where to retrieve the standard outputsALICE PROCEDURE: to remove the outputsandbox field
of the jdl files created by the CREAM module Only available in case of submission in debug mode
18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 7
Implementation into AliEn (II)gridftp server is required
Required to retrieve the standard outputs of the job agentsSites are free to decide ist implementation (proposal:
VOBOX)200 GB of space requiredIt will be used ONLY if the submission has been done in
debug modeChange on the proxy renewal mechanism
Submision optimization purposeThe user proxy will be renewed only once per hour
In previous AliEn version this procedure was executed BEFORE each agent submission
The procedure has been implemented ALSO in LCG.pm
18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 8
The 2nd test phaseAfter a debug phase of the CREAM module in January 2009,
the new CREAM module in production the 19th of February (2nd testing phase started)Stability and performance are currently the most important test
issues at the sites providing CREAM-CEThe deployment of a 2nd VOBOX ensures that the production will
continue on parallel through the WMS A unique VOBOX would require a dedicated babysitting of the
system (not realistic)Feedback of all issues are directly provided to the CREAM
developersAs of today, 11 sites are providing CREAM CE
18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 9
Site queues Status of the queues 2nd VOBOX VOBOX with clients General Status
18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 10
Site CREAM-CEs
CREAM Status
2nd VOBOX Clients in VOBOX
General Status
FZK 1 (4 queues) OK YES YES OK
Kolkata 2 OK YES YES OK
Athens 1 OK NO NO NOT OK
KISTI 1 OK YES YES OK
GSI 1 OK NO YES NOT OK*
IHEP
RAL 1 OK NO YES OK*
CNAF 1 OK YES YES OK
CERN 2 (3 queues each)
OK YES YES OK
Torino 1 OK YES YES OK
SARA 1 OK In preparation
YES In testing
Status of the sites (I)Site queues
18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 11
FZKMinor actions required during the 2nd phase test
Delete some sandbox directories (hitting file limit again 32K subdirs) Procedure not neccessary in the next CREAM versions
46530 jobs since the 19th of Feb through the FZK CREAM-CERAL
No special actions reported by the site for service maintenance2678 jobs executed using the local CREAM-CE
KolkataDebugging phase performed directly with the developer
(Massimo Sgaravatto) In production from 9th of March
Status of the sites (II)Site queues
18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 12
CERNTwo CEs have been provided the 9th of March to ALICE for
testing In production since the 10th of March (voalice03 used for this
production)SLC5 WNs behind the CREAM-CE17247 jobs since the 10th of March
GSIStill pending the setup of a 2nd VOBOXThe CREAM-CE performing well
CNAFCREAM-CE ready to enter production at the end of February After some instabilities observed last week (lack of automatic
purge, entered the production back the 13th of March) Info provider of the CREAM-CE showing certain instabilities
Status of the sites (III)Site queues
18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 13
KISTIInstabilities at the VOBOX level prevents the full setup of
the local CREAM-CE in productionCREAM-CE system performing well
ATHENSThe CREAM-CE is working but the site cannot be put in
production No CREAM clients on the VOBOX
IHEPCREAM-CE is not working yet (siter admin working on)Missing infrastructure - no 2nd VOBOX (it will be provided
next week)
Status of the sites (IV)Site queues
18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 14
SARASystem tested yesterday evening with some few jobsStill in testing phase
TorinoSystem in production since last weekAlready 744 jobs executed through the local CREAM system
Subatech2nd vobox already provided, the setup of the CREAM-CE is
ongoing
Reminder: How to provide CREAM-CE services for ALICESite queues
18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 15
During the last October pre-GDB meeting it was explicitly mentioned:Unlikely to be deployable as an lcg-CE replacement on this
timescale (downtime period), but we can continue with rollout in parallel.
In addition during the November pre-GDB meeting it was concluded:The lcg-CE replacement will required the WMS submission in
place and the resolution of the proxy renewal issue (among more other points related to the service performance)
It was encouraged however the deployment of the system in parallel to the LCG-CE
Reminder: How to provide CREAM-CE services for ALICE (II)Site queues
18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 16
The parallel LCG-CE vs. CREAM-CE setup in terms of ALICE computing model means the deployment of a 2nd VOBOX Each VOBOX is able to submit to a specific backend One VOBOX LCG-CE OR CREAM-CE submission: replacement
approach Two VOBOXES LCG-CE AND CREAM-CE submission: parallel
approachThis is a temporary solution during the parallel running phase
As soon as the replacement is ensured and the LCG-CE is deprecated ALICE will not required a 2nd VOBOX
Remarks for the 2nd VOBOX deployment Its setup is not sign with blood Each case can be studied individually BUT! Sites with important Storage capability for ALICE should be
included in the list of sites providing a 2nd VOBOX
Reminder: How to provide CREAM-CE services for ALICE (III)Site queues
18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 17
Setup of the ALICE production queue behind the CREAM-CEThis procedure puts the CREAM-CE directly in production
GridFTP serverRequired to retrieve the job (agent) outputsRemoved from the VOBOX in January 2008 with the
deployment of the gLite3.1 VOBOX It was not longer required by the 4 LHC experiments at that
timeNo specific wish for the placement of this service
It can be provided into the VOBOX but this site decision
Future PlansSite queues
18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 18
Small changes in the CREAM module are still neededThe current implementation of the CREAM-CE via CLI
allows the declaration of a single queue onlySites can provide several queues per site (moreover T0/T1
sites)The implementation of submission to several queues must
be done to the application levelPROPOSAL for ALICE (in 3 lines of code):
Definition of a range per queue at the LDAP levelCalculation of a random number before each agent
submissionAssignment of a queue based on the random number/range
matchmaking
ConclusionsSite queues
18/03/09 ALICE Offline Week -- CREAM-CE Use and Status for ALICE 19
The ALICE experience with the current CREAM-CE service is very positiveStable (and maintenance-free) operation is achieved
quickly after the initial debugging periodHigh performance and scalability (FZK 2000+ parallel
jobs) served by a single CREAM-CEExcellent support provided by the developers
Special thanks to Massimo Sgravatto (INFN Padova)ALICE is working with all sites to install a CREAM-
CE In full production before start of data taking