1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)
-
Upload
lynne-doyle -
Category
Documents
-
view
216 -
download
0
Transcript of 1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)
1
OPERATIONAL ISSUES FOR THE ALICE EXPERIMENT
WLCG-GDB Meeting. CERN, 12 May 2010
Patricia Méndez Lorenzo (CERN, IT-ES)
2
Outlook
2010 Data Taking: Results The new AliEn2.18 version WLCG services news
Failover mechanism for the VOBOXES Deprecation of the LCG-CE Raw data transfers and monitoring
Operational procedures Summary and Conclusions
3
2010 Data Taking: Results Since Feb. till end of Mar. cosmic-
ray data taking ~105 events
pp run since March 30th
7 TeV: 40x106 events 0.9 TeV: 7x106 events
Run processing: start immediately after RAW data transferred to CERN MSS• Average – 5h per job• At 10h, 95% of the runs
processed• At 15h, 99% of the runs
processed
Raw data processing• Pass1-6 completed for 0.9
and 2.36 TeV data• Pass1@T0 for 7TeV data
follows data taking• Analysis train running
weekly: QA, physics working groups organized analysis
Raw data registration: ~77TB
LHC restart
MC productionSeveral production cycles for 0.9, 2.36 and 7TeV pp: 17x106 events with various generators and conditions from real data taking
5h 10h 15h
4
Job profile and site distribution
Remarkable stability at all sites during the data taking
5
Transfers
Peak ~ 125MB/sAverage ~ 30MB/sTotal transferred: 28.26 TB(only “good” runs are being transferred)
• Full runs transferred to each T1 site
• SE choice based on the ML tests at transfer time• Equal conditions,
SE taken randomly • This will change to
chose the SE based on the number of resources provided by the site
• Distribution already defined in SC3
6
AliEnv2.18
Many new features included in AliEnv2.18 solving quite a lot of previous challenges Deployment of this version transparently from central services Simultaneously to the startup of data taking
We mention here two important improvements: Implementation of Job and File Quotas
Limit on the available resources per user Jobs: # jobs, cpuCost, runningtime Files: #files, total size (including replicas)
Improved SE discovery Finding the closest working SEs of a QoS once the file has been registered in
the catalogue For reading and writing and taking into account ML tests
Simplifying the selection of SE Giving more options in case of special needs
7
How the SE discovery works
Example: Writing purposes
Client
I am in Madrid, give me SEs
Try: CCIN2P3, CNAF and Kosice
Authen
File Catalogue
SERank
Optimizer
MonaLisa
Similar process for reading
Number of SE, QoS, avoid SE… can be selected
8
WLCG Services news: CREAM
2009 approach: CREAM-CE implementation on AliEn and distribution System available at T0, all T1 (but NIKHEF at that time) and several T2
sites Dual submission (LGC-CE and CREAM-CE) at all sites providing CREAM Second VOBOX was required at the sites providing CREAM to ensure the
duality approach LCG-CE vs. CREAM-CE 2010 approach: Deprecation of the gLite-WMS
Latest news in terms of sites: 3rd CREAM-CE at CERN in SL5 (ce203) announced on Monday night, has
entered production immediately NIKHEF has announced a local CREAM-CE yesterday afternoon. System
successfully tested by ALICE and included in production Actively involved in the operation of the service at all sites
together with the site admins and the CREAM-CE developers
9
WLCG Services news: CREAM
ALICE has established the 31st of May as the deadline to have CREAM-CE at all sites After that date and based on the status of the pending
sites, these sites might be blacklisted Based on the current status, we can say that ALICE
is running in CREAM-mode at all sites T0 is still running in dual mode and the deprecation of the
LCG-CE is not expected for the moment
10
Latest requirement: CREAM1.6
CREAM-CE 1.6 has been released in production for gLite 3.2/sl5_x86_64 https://savannah.cern.ch/patch/?3959 The relevant version for gLite 3.1/sl4_i386 is already
released in the staged-rollout ALICE sites encouraged to migrate to CREAM1.6 as
soon as possible:1. A large number of bugs reported by ALICE sites admins
have been solved in the mentioned version2. It will allow a lighter distribution of the current gLite3.2
VOBOX
11
Reported ALICE bugs solved in CREAM1.6
Purge issues: ALICE REPORT: Wrong report of job status. CREAM’s vision of running jobs de-
synchronized CREAM job status can be wrongly reported because of some misconfigurations or
because of these two bugs in the BLAH Blparser #55078: Possible final state not considered in BLParserPBS and BUpdaterPBS (Ready for review) #54949: Some job can remain in running state when BLParser is restarted for both lsf and pbs
(Ready for review) #55420: Allow admin to purge CREAM jobs in a non terminal status (verified)
Disk space issues: ALICE REPORT: Issues regarding the cleanup of /opt/glite/var/cream/user_proxy area
#49497: user proxies on CREAM do not get cleaned up (Ready for review)
Load issues: ALICE REPORT: When tomcat restarted the system can take up to 15 min before
submitting new jobs The slow start of CREAM is also due to the problems coming from jobs reported in
wrong status #51978: CREAM can be slow to start (verified)
12
Reported ALICE bugs solved in CREAM1.6
Load issues (cont): ALICE REPORT: Grow up of the UNIX load. Load increases during automatic
purge operations. Also visible during high job submission rates #58103: Cream database Query performance (Ready for review). The GRNET report
“CREAM performance report”: very heavy queries are performed during purge operations
Other issues: ALICE REPORT: blparser is not automatically restarted at boot time (only
tomcat). Blparser has to be restarted by hand in order to recover the queue info #56518: BLAH blparser doesn't start after boot of the machine (verified)
ALICE REPORT: wrong SGM mapping. Jobs submission fails when the jdl contains InputSandbox The origin of the problem is a wrong user mapping between CREAM and gridftp #58941: lcmaps confs for glexec and gridftp are not fully synchronized (TM)
(verified)
13
The future VOBOX
Direct submission of jobs via the CREAM-CE requires the specification of a gridftp server to save the OSB Server specified to the level of the jdl file ALICE solved it requiring a gridftp server at the local VOBOX
(distributed with gLite3.2 VOBOX) OSB cannot be retrieved from CREAM disk via any client
command Well… not fully true. Functionality possible but not exposed
Lack of space management mechanism discourages such a procedure Requirements to expose this feature
Automatic purge procedures Limiters blocking new submissions in case of low free disk space
14
The future VOBOX Automatic purge procedures
Already included in CREAM1.5 according to a configurable policy Sandbox area for a job deleted while purging the job http://grid.pd.infn.it/cream/field.php?
n=Main.HowToPurgeJobsFromTheCREAMDB Limiters
New feature included in CREAM1.6 Several users asking for the possibility to save the OSB in CREAM-CE
CREAM1.6 exposes for the 1st time the possibility to leave the OSB in the CREAM-CE If OSB can be left in the CREAM-CE a gridftp server at the VOBOX is not
longer needed Feature successfully tested by ALICE in Torino (CREAM1.5) trusting the
available purge procedure The implementation in AliEn is very simple but not backward compatible
15
WLCG services news: VOBOX
For the 2009 approach, ALICE required a 2nd VOBOX at those sites providing both submission backend (LCG-CE vs. CREAM) Motivation extensively explained during previous GDB
meetings 2010 approach foresees a single backend: CREAM-CE
In principle a single VOBOX is needed What to do with the 2nd VOBOX?
Rescue it: FAILOVER MECHANISM This approach has been included in AliEnv2.18 to take
advantage of the 2nd VOBOX deployed in several ALICE sites ~25 sites currently providing >=2 VOBOXES
16
VOBOX: The failover mechanism. The Setup
Same configuration for both local VOBOXES No relevance for any of them They run exactly the same services and share the same
software area AliEnv2.18 implementation
Simple approach All services (but MonaLisa) will try to connect subsequently
to the 1st available host of a list included in LDAP The list contains the names of the local VOBOXES Connection will be establish with the first available VOBOX that
will take the whole load in case of failures
17
WLCG Services news: raw transfers monitoring
Dashboard monitoring Already available
18
Operations Procedures
Issues daily reported at the ops. Meeting The weekly ALICE TF meeting includes now analysis items
(TF & AF meeting) Moved to 16:30 to contact with the American sites
Latest issues at T0 CAF nodes
Instabilities in some nodes have been observed in the last weeks Thanks to the experts at the IT for the prompt answers and actions
AFS space Replication of afs ALICE volumes Separation in readable and writable volumes Thanks to Harry and Rainer for their help
19
Summary and conclusions
Very smooth operations of all sites and services during the 2010 data taking Very good responds of site admins and experts in case of problems
A new AliEnv2.18 has been deployed and will be the responsible of the data taking infrastructure of ALICE in the next months Transparent deployment of the new version in parallel to the start up
of the data taking In terms of services, sites are encouraged to provide the
latest CREAM1.6 version as soon as possible