Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

37
Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007

Transcript of Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Page 1: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Tier-1 Status

Andrew SansumGRIDPP18

21 March 2007

Page 2: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Staff Changes

• Steve Traylen left in September• Three new Tier-1 staff

– Lex Holt (Fabric Team) – James Thorne (Fabric Team)– James Adams (Fabric Team)

• One EGEE funded post to operate a PPS (and work on integration with NGS):– Marian Klein

Page 3: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Team Organisation

Grid Services

Grid/Support RossConduracheHodgesKlein (PPS)Vacancy

Fabric(H/W and OS)

Bly (team leader)WheelerHoltThorneWhite (OS support)Adams (HW support)

CASTORSW/Robot

Corney (GL)Strong (Service Manager)Folkes (HW Manager)deWittJensenKrukKetleyBonnet2.5 FTE effort

Machine Room operations

Networking Support

Database Support (Brown)

Project Management (Sansum/Gordon/(Kelsey))

Page 4: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Hardware Deployment - CPU

• 64 Dual core/dual CPU Intel Woodcrest 5130 systems delivered in November (about 550 KSI2K)

• Completed acceptance tests over Christmas and into production mid January

• CPU farm capacity now (approximately):– 600 systems– 1250 cores– 1500 KSI2K

Page 5: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Hardware Deployment - Disk

• 2006 was a difficult year with deployment hold-ups:– March 2006 delivery: 21 servers, Areca RAID

controller – 24*400GB WD (RE2) drives. Available: January 2007

– November 2006 delivery: 47 servers, 3Ware RAID controller – 16*500GB WD (RE2). Accepted February 2007 (but still deploying to CASTOR)

– January 2007 delivery:39 servers, 3Ware RAID controller – 16*500GB WD (RE2). Accepted March 2007. Ready to deploy to CASTOR

Page 6: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Disk Deployment - Issues

• March 2006 (Clustervision) delivery:– Originally delivered with 400GB WD400YR drives– Many drive ejects under normal load test (had worked OK when we

tested in January).– Drive specification found to have changed – compatibility problems with

RAID controller (despite drive being listed as compatible)– Various firmware fixes tried – improvements but not fixed.– August 2006 WD offer to replace for 500YS drive– September 2006 – load test of new configuration begin to show

occasional (but unacceptably frequent) drive ejects (different problem).– Major diagnostic effort by Western Digital – Clustervision also trying

various fixes lots of theories – vibration, EM noise, protocol incompatability – various fixes tried (slow as failure rate quite low)..

– Fault hard to trace, various theories and fixes tried but eventually traced (early December) to faulty firmware.

– Firmware updated and load test shows problem fixed (mid Dec). Load test completes in early January and deployment begins.

Page 7: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Disk Deployment - Cause

• Western digital working at 2 sites – logic analysers on SATA interconnect.

• Eventually fault traced to a “missing return” in the firmware:– If drive head stays too long in one place it

repositions to allow lubricant to migrate.– Only shows up under certain work patterns– No return following reposition and 8 seconds

later controller ejects drive

Page 8: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Disk Deployment

#Servers

Capacity (TB)

2006 57 179

Jan 2007 21 190

Feb 2007 47 238

March 2007

39 197

Total 138 750

Page 9: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Hardware Deployment - Tape

• SL8500 tape robot upgraded to 10000 slots in August 2006.

• GRIDPP buy 3 additional T10K tape drives in February 2007 (now 6 drives owned by GRIDPP)

• Further purchase of 350TB tape media in February 2007.

• Total Tape capacity now 850-900TB (but not all immediately allocated – some to assist CASTOR migration – some needed for CASTOR operations.

Page 10: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Hardware Deployment - Network

• 10GB line from CERN available in August 2006• RAL was scheduled to attach to Thames Valley Network

(TVN) at 10GB by November 2006:– Change of plan in November – I/O rates from Tier-1 already

visible to UKERNA. Decide to connect T1 by 10Gb resilient connection direct into SJ5 core (planned mid Q1)

– Connection delayed but now scheduled for end of March

• GRIDPP load tests identify several issues at RAL firewall. These resolved but plan is now to bypass the firewall for SRM traffic from SJ5.

• A number of internal Tier-1 topology changes while we have enhanced LAN backbone to 10Gb in preparation to SJ5

Page 11: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

RALSite

55105530

4 x 5530

RouterA

OPNRouter

3 x 5510+ 5530

6 x 5510+ 5530

ADSCaches

CPUs +Disks

CPUs +Disks

CPUs +Disks

CPUs +Disks

10Gb/s

10Gb/sto CERN

N x 1Gb/s

10Gb/s

5 x 5510+ 5530

2 x 5510+ 5530

RALTier 2

Tier 1

Oraclesystems

1Gb/s to SJ4

Tier-1 LAN

Page 12: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

New Machine Room•Tender underway, planned completion: August 2008•800M**2 can accommodate 300 racks + 5 robots•2.3MW Power/Cooling capacity (some UPS)•Office accommodation for all E-Science staff•Combined Heat and Power Generation (CHP) on site•Not all for GRIDPP (but you get most)!

Page 13: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Tier-1 Capacity delivered to WLCG

(2006)

Asia Pacific4%

BNL18%

CERN11%

FNAL5%

FZK6% IN2P3

8%

INFN-T115%

PIC5%

RAL17%

SARA/NIKHEF10%

Others1%

Page 14: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Last 12 months CPU Occupancy

+260 KSI2KMay 2006

+550 KSI2KJanuary 2007

Page 15: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Recent CPU Occupancy (4 weeks)

Air-conditioning Work (300KSI2K offline)

Page 16: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

CPU Efficiencies

Page 17: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

CPU Efficiencies

CMS merge jobs – hang on CASTOR

ATLAS/LHCB jobs hanging on dCache

Babar jobs running slow – reason unknown

Page 18: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

3D Service

• Used by ATLAS and LHCB to distribute conditions data by Oracle streams

• RAL one of 5 sites who deployed a production service during Phase I.

• Small SAN cluster – 4 nodes, 1 Fibre channel RAID array.

• RAL takes a leading role in the project.

Page 19: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Reliability

• Reliability matters to the experiments.– Use the SAM monitoring to identify priority areas – Also worrying about job loss rates

• Priority at RAL to improve reliability:– Fix the faults that degrade our SAM availability– New exception monitoring and automation system

based on Nagios

• Reliability is improving, but work feels like an endless treadmill. Fix one fault and find a new one.

Page 20: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Reliability - CE

• Split PBS server and CE long time ago• Split CE and local BDII• Site BDII times out on CE info provider

– CPU usage very high on CE info provider “starved”– Upgraded CE to 2 cores.

• Site BDII still times out on CE info provider – CE system disk I/O bound– Reduce load (changed backups etc)– Finally replaced system drive with faster model.

Page 21: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

CE Load

Page 22: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Job Scheduling

• Sam Jobs failing to be scheduled by MAUI– SAM tests running under operations VO, but share gid with

dteam. dteam has used all resource – thus MAUI starts no more jobs

– Change scheduling and favour ops VO (Long term plan to split ops and dteam)

• PBS server hanging after communications problems – Job stuck in pending state jams whole batch system (no job

starts – site unavailable!)– Auto detect state of pending jobs and hold – remaining jobs

start and availability good– But now held jobs impact ETT and we receive less work from RB

– have to delete held jobs

Page 23: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Jobs de-queued at CE

• Jobs reach the CE and are successfully submitted to the scheduler but shortly afterwards CE decides to de-queue the job.– Only impacts SAM monitoring occasionally – May be impacting users more than SAM but

we cannot tell from our logs– Logged a GGUS ticket but no resolution

Page 24: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

RB

• RB running very busy for extended periods during the summer:– Second RB (rb02) added early November but no

transparent way of advertising. Needs UIs to manually configure (see GRIDPP wiki).

• Jobs found to abort on rb01 linked to size of database– Database needed cleaning (was over 8GB)

• Job cancels may (but not reproducibly) break RB (RB may go 100% CPU bound) – no fix to this ticket.

Page 25: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

RB Load

rb02 deployed

Drained to fix hardware

rb02 High CPU Load

Page 26: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Top Level BDII

• Top level BDII not reliably responding to queries– Query rate too high – UK sites failing SAM tests for extended periods

• Upgraded BDII to two servers on DNS round robin– Sites occasionally fail SAM test

• Upgraded BDII to 3 servers (last Friday)– Hope problem fixed – please report timeouts.

Page 27: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

FTS

• Reasonably reliable service– Based on a single server– Monitoring and automation to watch for

problems

• At next upgrade (soon) move from single server to two pairs:– One pair will handle transfer agents– One pair will handle web front end.

Page 28: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

dCache

• Problems with gridftp doors hanging– Partly helped by changes to network tuning– But still impacts SAM tests (and experiments).

Decide to move SAM CE replica-manager test from dCache to CASTOR (cynical manoeuvre to help SAM)

• Had hoped this month’s upgrade to version 1.7 would resolve problem– Didn’t help– Have now upgraded all gridftp doors to Java 1.5. No

hangs since upgrade last Thursday.

Page 29: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

SAM Availability

RAL-LCG2 Availability/Reliability

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

May-06 Jun-06 Jul-06 Aug-06 Sep-06 Oct-06 Nov-06 Dec-06 Jan-07 Feb-07

Available

Old Reliability

New Reliability

Target

Average

Best 8

Page 30: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

CASTOR• Autumn 2005/Winter 2005:

– Decide to migrate tape service to CASTOR– Decision that CASTOR will eventually replace dCache for disk pool management - CASTOR2

deployment starts• Spring/Summer 2006: Major effort to deploy and understand CASTOR

– Difficult to establish a stable pre-production service– Upgrades extremely difficult to make work – test service down for weeks at a time

following upgrade or patching.• September 2006:

– Originally planned we have full production service – Eventually – after heroic effort CASTOR team establish a pre-production service for CSA06

• October 2006– But we don’t have any disk – have to – BIG THANK YOU PPD!– CASTOR performs well in CSA06

• November/December work on CASTOR upgrade but eventually fail to upgrade• January 2007 declare CASTOR service as production quality• Feb/March 2007:

– Continue work with CMS as they expand range of tasks expected of CASTOR – significant load related operational issues identified (eg CMS merge jobs cause LSF meltdown).

– Start work with Atlas/LHCB and MINOS to migrate to CASTOR

Page 31: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

CASTOR Layout

ralsrma ralsrmb ralsrmc ralsrmd ralsrme ralsrmf

D1T0

cmswanout

D0T1 prdD0T1 tmpD0T1 CMSwanin

cmsFarmRead lhcbD1T0 atlasD1T0prod

atlasD1T0usr atlasD1T1 atlasD0T1test atlasD1T0test

SRM 1

Disk

Pools

service classes

Page 32: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

CMS

Page 33: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Phedex Rate to CASTOR (RAL Destination)

Page 34: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Phedex Rate to CASTOR RAL Source

Page 35: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

SL4 and gLite

• Preparing to migrate some batch workers to SL4 for experiment testing.

• Some gLite testing (on SL3) already underway but becoming increasingly nervous about risks associated with late deployment of forthcoming SL4 gLite release

Page 36: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Grid Only

• Long standing milestone that Tier-1 will offer a “Grid Only” service by the end of August 2007.

• Discussed at January UB. Considerable discussion WRT what “Grid Only” means.

• Basic target confirmed by Tier-1 board but details still to be fixed WRT exactly what remains as needed.

Page 37: Tier-1 Status Andrew Sansum GRIDPP18 21 March 2007.

Conclusions

• Last year was a tough year but we have eventually made good progress.– A lot of problems encountered– A lot accomplished

• This year focus will be on:– Establishing a stable CASTOR service that meets the

needs of the experiments– Deploying required releases of SL4/gLite– meeting (hopefully exceeding) availability targets– Hardware ramp up as we move towards GRIDPP3