London Tier 2 Status Report GridPP 13, Durham, 4 th July 2005 Owen Maroney, David Colling.

11
London Tier 2 Status Report GridPP 13, Durham, 4 th July 2005 Owen Maroney, David Colling

Transcript of London Tier 2 Status Report GridPP 13, Durham, 4 th July 2005 Owen Maroney, David Colling.

Page 1: London Tier 2 Status Report GridPP 13, Durham, 4 th July 2005 Owen Maroney, David Colling.

London Tier 2

Status ReportGridPP 13, Durham, 4th July

2005Owen Maroney, David Colling

Page 2: London Tier 2 Status Report GridPP 13, Durham, 4 th July 2005 Owen Maroney, David Colling.

4th July 2005 GridPP 13: London Tier 2 Status

Brunel

• 2 WN PBS @ LCG-2_4_0– R-GMA and APEL installed– RH7.3 LCFG installed

• Additional farm being installed – SL3– Private networked WN– 16 nodes– Expected to move into production after 2_6_0

upgrade– Hoping to bring further resources over the

summer – Recruiting support post with RHUL (Job offer

made)

Page 3: London Tier 2 Status Report GridPP 13, Durham, 4 th July 2005 Owen Maroney, David Colling.

4th July 2005 GridPP 13: London Tier 2 Status

Imperial College London

• Appointment of Mona Aggarwal to GridPP Hardware Support Post • 52 CPU Torque HEP farm @ LCG-2_5_0

– RGMA and APEL installed– OS RHEL 3

• IC HEP participating in SC3 as the UK CMS site– dCache SRM installed with 2.6TB storage + 6TB on order– Another 6TB on order

• Numerous power outages (scheduled and unscheduled) have caused availability problems

• London e-Science Centre

- SAMGrid installed across HEP and LeSC• Certified for D0 data reprocessing• 186 Job Slots

– SGE farm, 64bit RHEL• Globus-jobmanager installed• Beta version of SGE plug-in to generic information provider• Firewall issues had blocked progress but this has now been resolved. Testing will

start soon.– “Community of Interest” mailing list established for sites interested in SGE

integration with LCG– [email protected]

• 19 subscribers from sites in UK, Italy, Spain, Germany, France, Russia

Page 4: London Tier 2 Status Report GridPP 13, Durham, 4 th July 2005 Owen Maroney, David Colling.

4th July 2005 GridPP 13: London Tier 2 Status

Queen Mary

• 320 CPU Torque farm– After difficulties with Fedora 2, have moved LCG WN

to SL3– Departure of key staff member just as LCG-2_4_0

released led to manpower problems• GridPP Hardware Support post filled• Guiseppe Mazza start(ed) 1st July

– RGMA and APEL installed early in June.

Page 5: London Tier 2 Status Report GridPP 13, Durham, 4 th July 2005 Owen Maroney, David Colling.

4th July 2005 GridPP 13: London Tier 2 Status

Royal Holloway

• Little change: 148 CPU Torque farm– LCG 2_4_0– OS SL3– RGMA installed

• Problems with APEL default installation• Gatekeeper and batch server on separate nodes

• Little manpower available– Shared GridPP Hardware Support post with

Brunel still in recruitment process Job offer made?

Page 6: London Tier 2 Status Report GridPP 13, Durham, 4 th July 2005 Owen Maroney, David Colling.

4th July 2005 GridPP 13: London Tier 2 Status

University College London

• UCL-HEP 20 CPU PBS farm @ LCG-2_4_0– OS SL3– RGMA installed

• Problems with APEL default installation• Separate batch server to gatekeeper

• UCL-CCC 88 CPU Torque farm @ LCG-2_4_0– OS SL3– RGMA and APEL installed– Main cluster is SGE farm

• interest in putting SGE farm into LCG and integrating nodes into single farm

Page 7: London Tier 2 Status Report GridPP 13, Durham, 4 th July 2005 Owen Maroney, David Colling.

4th July 2005 GridPP 13: London Tier 2 Status

Current site status summary

Site Service nodes

Worker nodes

Local network connectivity

Site connectivity

SRM Days SFT failed

Days in scheduled maintenance

Brunel RH7.3 LCG2.4.0

RH7.3 LCG2.4.0

1Gb 100Mb No 21 16

Imperial RHEL3 LCG2.5.0

RHEL3 LCG2.5.0

1Gb 1Gb dCache 26 28

QMUL SL3LCG2.4.0

SL3 LCG2.4.0

1Gb 100Mb No 45 12

RHUL RHEL3 LCG2.4.0

RHEL3 LCG2.4.0

1Gb 1Gb No 22 29

UCL (HEP) SL3LCG2.4.0

SL32.4.0

1Gb 1Gb No 9 30

UCL (CCC) SL3LCG2.4.0

SL3LCG2.4.0

1Gb 1Gb No 12 9

1) Local network connectivity is that to the site SE2) It is understood that SFT failures do not always result from site problems, but it is the best measure

currently available.

Page 8: London Tier 2 Status Report GridPP 13, Durham, 4 th July 2005 Owen Maroney, David Colling.

4th July 2005 GridPP 13: London Tier 2 Status

LCG resources

Site Estimated for LCG Currently delivering to LCG

Total job slots

CPU (kSI2K)

Storage(TB)

Total jobs slots

CPU(kSI2K)

Storage (TB)

Brunel 60 60 1 4 4 0.4

IC 66 33 16 52 26 3

QMUL 572 247 13.5 464 200 0.1

RHUL 142 167 3.2 148 167 7.7

UCL 204 108 0.8 186 98 0.8

Total 1044 615 34.5 854 495 12

1) The estimated figures are those that were projected for LCG planning purposes:http://lcg-computing-fabric.web.cern.ch/LCG-Computing-Fabric/GDB_resource_infos/Summary_Institutes_2004_2005_v11.htm

2) Current total job slots are those reported by EGEE/LCG gstat page.

Page 9: London Tier 2 Status Report GridPP 13, Durham, 4 th July 2005 Owen Maroney, David Colling.

4th July 2005 GridPP 13: London Tier 2 Status

Resources used per VO over quarter (kSI2K

hours)

Site CPU ALICE ATLAS BABAR CMS LHCB ZEUS Total

Brunel 6 149 155

Imperial 19 848 221 4,863 312 6,263

QMUL 41 116 82,697 82,854

RHUL 1,124 1,840 79 42,218 45,261

UCL 6,982 126 14,115 21,223

Total 1,143 9,711 548 144,042

312 155,756

Data taken from APEL

Page 10: London Tier 2 Status Report GridPP 13, Durham, 4 th July 2005 Owen Maroney, David Colling.

4th July 2005 GridPP 13: London Tier 2 Status

Expressed as a pie chartNjobs percentage numbers of jobs

51,209 according to APEL

Page 11: London Tier 2 Status Report GridPP 13, Durham, 4 th July 2005 Owen Maroney, David Colling.

4th July 2005 GridPP 13: London Tier 2 Status

Site Experiences

• LCG-2_4_0 release was first “scheduled” release date– Despite a slippage of 1 week in the release (and an overlap with EGEE conference)

all LT2 sites upgraded within 3 weeks• Some configuration problems for a week after

– Overall experience was better than the past• Farms are not fully utilised

– This is true of the grid as a whole– Will extend the range of VOs supported

• Overall improvement in Scheduled Downtime (SD) compared to previous quarter.

– QMUL had manpower problems• NB: Although QMUL had highest number of (SFT failure+SD) provided most actual processing

power during quarter!– IC had several scheduled power outages, plus two unscheduled power failures

• Caused knock-on failures for sites using BDII hosted at IC

• IC installed dCache SRM in preparation for SC3– Installation configuration not simple: default configuration was not suitable for most

Tier 2 sites and changing from the default was hard– Some security concerns: installations not Secure by Default

• Coordinator- Owen Leaving in two weeksHave made an offer