1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

19
1 OPERATIONAL ISSUES FOR THE ALICE EXPERIMENT WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT- ES)

Transcript of 1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

Page 1: 1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

1

OPERATIONAL ISSUES FOR THE ALICE EXPERIMENT

WLCG-GDB Meeting. CERN, 12 May 2010

Patricia Méndez Lorenzo (CERN, IT-ES)

Page 2: 1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

2

Outlook

2010 Data Taking: Results The new AliEn2.18 version WLCG services news

Failover mechanism for the VOBOXES Deprecation of the LCG-CE Raw data transfers and monitoring

Operational procedures Summary and Conclusions

Page 3: 1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

3

2010 Data Taking: Results Since Feb. till end of Mar. cosmic-

ray data taking ~105 events

pp run since March 30th

7 TeV: 40x106 events 0.9 TeV: 7x106 events

Run processing: start immediately after RAW data transferred to CERN MSS• Average – 5h per job• At 10h, 95% of the runs

processed• At 15h, 99% of the runs

processed

Raw data processing• Pass1-6 completed for 0.9

and 2.36 TeV data• Pass1@T0 for 7TeV data

follows data taking• Analysis train running

weekly: QA, physics working groups organized analysis

Raw data registration: ~77TB

LHC restart

MC productionSeveral production cycles for 0.9, 2.36 and 7TeV pp: 17x106 events with various generators and conditions from real data taking

5h 10h 15h

Page 4: 1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

4

Job profile and site distribution

Remarkable stability at all sites during the data taking

Page 5: 1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

5

Transfers

Peak ~ 125MB/sAverage ~ 30MB/sTotal transferred: 28.26 TB(only “good” runs are being transferred)

• Full runs transferred to each T1 site

• SE choice based on the ML tests at transfer time• Equal conditions,

SE taken randomly • This will change to

chose the SE based on the number of resources provided by the site

• Distribution already defined in SC3

Page 6: 1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

6

AliEnv2.18

Many new features included in AliEnv2.18 solving quite a lot of previous challenges Deployment of this version transparently from central services Simultaneously to the startup of data taking

We mention here two important improvements: Implementation of Job and File Quotas

Limit on the available resources per user Jobs: # jobs, cpuCost, runningtime Files: #files, total size (including replicas)

Improved SE discovery Finding the closest working SEs of a QoS once the file has been registered in

the catalogue For reading and writing and taking into account ML tests

Simplifying the selection of SE Giving more options in case of special needs

Page 7: 1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

7

How the SE discovery works

Example: Writing purposes

Client

I am in Madrid, give me SEs

Try: CCIN2P3, CNAF and Kosice

Authen

File Catalogue

SERank

Optimizer

MonaLisa

Similar process for reading

Number of SE, QoS, avoid SE… can be selected

Page 8: 1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

8

WLCG Services news: CREAM

2009 approach: CREAM-CE implementation on AliEn and distribution System available at T0, all T1 (but NIKHEF at that time) and several T2

sites Dual submission (LGC-CE and CREAM-CE) at all sites providing CREAM Second VOBOX was required at the sites providing CREAM to ensure the

duality approach LCG-CE vs. CREAM-CE 2010 approach: Deprecation of the gLite-WMS

Latest news in terms of sites: 3rd CREAM-CE at CERN in SL5 (ce203) announced on Monday night, has

entered production immediately NIKHEF has announced a local CREAM-CE yesterday afternoon. System

successfully tested by ALICE and included in production Actively involved in the operation of the service at all sites

together with the site admins and the CREAM-CE developers

Page 9: 1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

9

WLCG Services news: CREAM

ALICE has established the 31st of May as the deadline to have CREAM-CE at all sites After that date and based on the status of the pending

sites, these sites might be blacklisted Based on the current status, we can say that ALICE

is running in CREAM-mode at all sites T0 is still running in dual mode and the deprecation of the

LCG-CE is not expected for the moment

Page 10: 1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

10

Latest requirement: CREAM1.6

CREAM-CE 1.6 has been released in production for gLite 3.2/sl5_x86_64 https://savannah.cern.ch/patch/?3959 The relevant version for gLite 3.1/sl4_i386 is already

released in the staged-rollout ALICE sites encouraged to migrate to CREAM1.6 as

soon as possible:1. A large number of bugs reported by ALICE sites admins

have been solved in the mentioned version2. It will allow a lighter distribution of the current gLite3.2

VOBOX

Page 11: 1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

11

Reported ALICE bugs solved in CREAM1.6

Purge issues: ALICE REPORT: Wrong report of job status. CREAM’s vision of running jobs de-

synchronized CREAM job status can be wrongly reported because of some misconfigurations or

because of these two bugs in the BLAH Blparser #55078: Possible final state not considered in BLParserPBS and BUpdaterPBS (Ready for review) #54949: Some job can remain in running state when BLParser is restarted for both lsf and pbs

(Ready for review) #55420: Allow admin to purge CREAM jobs in a non terminal status (verified)

Disk space issues: ALICE REPORT: Issues regarding the cleanup of /opt/glite/var/cream/user_proxy area

#49497: user proxies on CREAM do not get cleaned up (Ready for review)

Load issues: ALICE REPORT: When tomcat restarted the system can take up to 15 min before

submitting new jobs The slow start of CREAM is also due to the problems coming from jobs reported in

wrong status #51978: CREAM can be slow to start (verified)

Page 12: 1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

12

Reported ALICE bugs solved in CREAM1.6

Load issues (cont): ALICE REPORT: Grow up of the UNIX load. Load increases during automatic

purge operations. Also visible during high job submission rates #58103: Cream database Query performance (Ready for review). The GRNET report

“CREAM performance report”: very heavy queries are performed during purge operations

Other issues: ALICE REPORT: blparser is not automatically restarted at boot time (only

tomcat). Blparser has to be restarted by hand in order to recover the queue info #56518: BLAH blparser doesn't start after boot of the machine (verified)

ALICE REPORT: wrong SGM mapping. Jobs submission fails when the jdl contains InputSandbox The origin of the problem is a wrong user mapping between CREAM and gridftp #58941: lcmaps confs for glexec and gridftp are not fully synchronized (TM)

(verified)

Page 13: 1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

13

The future VOBOX

Direct submission of jobs via the CREAM-CE requires the specification of a gridftp server to save the OSB Server specified to the level of the jdl file ALICE solved it requiring a gridftp server at the local VOBOX

(distributed with gLite3.2 VOBOX) OSB cannot be retrieved from CREAM disk via any client

command Well… not fully true. Functionality possible but not exposed

Lack of space management mechanism discourages such a procedure Requirements to expose this feature

Automatic purge procedures Limiters blocking new submissions in case of low free disk space

Page 14: 1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

14

The future VOBOX Automatic purge procedures

Already included in CREAM1.5 according to a configurable policy Sandbox area for a job deleted while purging the job http://grid.pd.infn.it/cream/field.php?

n=Main.HowToPurgeJobsFromTheCREAMDB Limiters

New feature included in CREAM1.6 Several users asking for the possibility to save the OSB in CREAM-CE

CREAM1.6 exposes for the 1st time the possibility to leave the OSB in the CREAM-CE If OSB can be left in the CREAM-CE a gridftp server at the VOBOX is not

longer needed Feature successfully tested by ALICE in Torino (CREAM1.5) trusting the

available purge procedure The implementation in AliEn is very simple but not backward compatible

Page 15: 1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

15

WLCG services news: VOBOX

For the 2009 approach, ALICE required a 2nd VOBOX at those sites providing both submission backend (LCG-CE vs. CREAM) Motivation extensively explained during previous GDB

meetings 2010 approach foresees a single backend: CREAM-CE

In principle a single VOBOX is needed What to do with the 2nd VOBOX?

Rescue it: FAILOVER MECHANISM This approach has been included in AliEnv2.18 to take

advantage of the 2nd VOBOX deployed in several ALICE sites ~25 sites currently providing >=2 VOBOXES

Page 16: 1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

16

VOBOX: The failover mechanism. The Setup

Same configuration for both local VOBOXES No relevance for any of them They run exactly the same services and share the same

software area AliEnv2.18 implementation

Simple approach All services (but MonaLisa) will try to connect subsequently

to the 1st available host of a list included in LDAP The list contains the names of the local VOBOXES Connection will be establish with the first available VOBOX that

will take the whole load in case of failures

Page 17: 1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

17

WLCG Services news: raw transfers monitoring

Dashboard monitoring Already available

Page 18: 1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

18

Operations Procedures

Issues daily reported at the ops. Meeting The weekly ALICE TF meeting includes now analysis items

(TF & AF meeting) Moved to 16:30 to contact with the American sites

Latest issues at T0 CAF nodes

Instabilities in some nodes have been observed in the last weeks Thanks to the experts at the IT for the prompt answers and actions

AFS space Replication of afs ALICE volumes Separation in readable and writable volumes Thanks to Harry and Rainer for their help

Page 19: 1 WLCG-GDB Meeting. CERN, 12 May 2010 Patricia Méndez Lorenzo (CERN, IT-ES)

19

Summary and conclusions

Very smooth operations of all sites and services during the 2010 data taking Very good responds of site admins and experts in case of problems

A new AliEnv2.18 has been deployed and will be the responsible of the data taking infrastructure of ALICE in the next months Transparent deployment of the new version in parallel to the start up

of the data taking In terms of services, sites are encouraged to provide the

latest CREAM1.6 version as soon as possible