TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost...

24
TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005

Transcript of TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost...

Page 1: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost

TRIUMF SITE REPORT

Corrie Kost

Update since Hepix Fall 2005

Page 2: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost

Devolving Server Functions

Windows print server cluster - 2 Dell PowerEdge SC1425 machines sharing external SCSI disk holding printers data

• Windows Print Server

• Windows Domain controller

OLDNEW

- 2 Dell PowerEdge SC1425 - primary & secondary Windows domain controllers

NEW

Page 3: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost

Update since Hepix Fall 2005

Page 4: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost

Update since Hepix Fall 2005

Waiting for 10Gb/sec DWDM XFP

Page 5: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost

Update since Hepix Fall 2005

40 km DWDM

64 wavelengths/fibre

CH34=193.4THz 1550.116nm

~ $10kUS each

Page 6: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost

Update since Hepix Fall 2005

Page 7: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost

Update since Hepix Fall 2005

Page 8: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost

Servers / Data Centre

GPS TIMETRPRINTCMMS

TRWEB

TGATEDOCUMENTS

CONDORG

TRSHARE

TRMAIL

TRSERV

LCG

WORKER

NODES

IBM

CLUSTER

1GB 1GB

TEMP TRSHARE

KOPIODOC

LCG STORAGE

IBM ~ 2TB STORAGE

TNT2K3

RH-FC-SL MIRROR

TRWINDATA

TRSWINAPPS

WINPRINT1/2

PROMISE STORAGE

Page 9: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost

Update since Hepix Fall 2005

TNT2K3

TRWINDATA

TRSWINAPPS

PROMISE STORAGE

TRPRINT

TRWEB

TGATE

CONDORG

TRSHARE

TRMAIL

TRSERV

DOCUMENTS

CMMS

GPS TIME

WINPRINT1WINPRINT2

Page 10: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost

TRIUMF-CERN ATLAS

Lightpath -

InternationalGrid Testbed

(CA*Net IGT)

Equipment

Amanda

Backup

ATLAS

Worker nodes: (evaluation units)

Blades, Dual/Dual 64-bit 3GHz Xeons

4 GB RAM, 80GB Sata

VOBOX: 2GB, 3GHz 64-bit Xeon, 2*160GB SATA

LFC: 2GB, 3GHz 64-bit Xeon, 2*160GB SATA

FTS: 2GB, 3GHz 64-bit Xeon, 3*73GB SCSI

SRM Head node

2GB, 64-bit Opteron 2*232GB RAID1

sc1-sc3 dCache Storage Elements

2GB, 3GHz 64-bit Xeon, 8*232GB RAID5

2 SDLT 160GB drives / 26 Cart

2 SDLT 300GB drives / 26 Cart

TIER1 prototype (Service Challenge)

Page 11: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost

ATLAS/CERN TRIUMF

Page 12: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost

Site Disk-Disk Disk-Tape

ASGC 100 75

TRIUMF 50 50

BNL 200 75

FNAL 200 75

NDGF 50 50

PIC 60* 60

RAL 150 75

SARA 150 75

IN2P3 200 75

FZK 200 75

CNAF 200 75

Tier0Tier1 Tests Apr 330

• Any MB/sec rates below 90% nominal needs explanation and compensation in days following.

• Maintain rates unattended over Easter weekend (April 14-16)

• Tape tests April 18-24

• Experiment-driven transfers April 25-30

* The nominal rate for PIC is 100MB/s, but will be limited by the WAN until ~November 2006.

https://twiki.cern.ch/twiki/bin/view/LCG/LCGServiceChallenges

Page 13: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost

ATLAS’ SC4 Plans – Extracted from Mumbai Workshop 17 Feb/2006(1)

• March-April (pre-SC4): 3-4 weeks in for internal Tier-0 tests (Phase 0)•April-May (pre-SC4): Tests of distributed operations on a “small” testbed (the pre-production system)•Last 3 weeks of June: Tier-0 test (Phase 1) with data distribution to Tier-1s (720MB/s + full ESD to BNL)•3 weeks in July: Distributed processing tests (Part 1)•2 weeks in July-August: Distributed analysis tests (Part 1)•3-4 weeks in September-October: Tier-0 test (Phase 2) with data to Tier-2s•3 weeks in October: Distributed processing tests (Part 2)•3-4 weeks in November: Distributed analysis tests (Part 2)

(1) https://twiki.cern.ch/twiki/bin/view/LCG/SCWeeklyPhoneCon060220

Update from Hepix Fall 2005

Page 14: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost

Repeated reads on same set of (typically 16) files (at ~ 600MB/sec) – during ~ 150 days ~ 7 PB (total since started ~13PB to March 30 – no reboot for 134 days)

Page 15: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost

Repeated reads on same set of (typically 16) files (at ~ 600MB/sec) – during ~ 150 days ~ 7 PB (total since started ~13PB to March 30 – no reboot for 134 days)

Daily Average (Feb/Mar 2006)

0100200300400500600700

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33

DAY

READ

(MB/

sec)

Page 16: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost

Keeping it Cool

•Central Computing Room isolation fixed –

•Combined two 11-Ton air-conditioners to even out load

•Adding heating coil to improve stability

•Blades for Atlas! – 30% less heat, 20% less TCO

100 W/sq-ft 200 W/sq-ft 400 W/sq-ft means cooling costs are a significant cost factor

Note: Electrical/Cooling costs estimated at $Can150k/yr

•Water cooled systems for (multicore/multicpu) blade systems?

Page 17: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost

Keeping it Cool2

HP offers Modular Cooling System (MCS)

• Used when rack > 10-15Kw

• US$30K

• Chilled (5-10C) water

• Max load 30Kw/rack (17GPM / 65LPM @ 5C water @ 20C air)

• Water cannot reach servers

• Door open? - Cold air out front, hot out back

• Significantly less noise with doors closed

• HWD 1999x909x1295mm (79”x36”x51”) 513Kg/1130lbs (empty)

• Not certified for Seismic or Zone 4http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00613691/c00613691.pdf

Page 18: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost

Amanda Backup at TRIUMF

Details by Steve McDonald Thursday ~ 4:30pm

Page 19: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost

End of Presentation

Extra Slides on SC4 plans for reference…

Page 20: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost

fts:  FTS Server       FTS= File Transfer Service       homepage:  http://egee-jra1-dm.web.cern.ch/egee%2Djra1%2Ddm/FTS/      Oracle database used       64-bit Intel Xeon 3 GHz       73 GB SCSI disks (3)       2 GB RAM       IBM 4560-SLX Tape Library attached (will have 2 SDLT-II drives       attached when they arrive, probably next week) SDLT-II does       300 GB native, 600 compressed.       Running SL 305 64 bit

lfc:  LFC Server       LFC= LCG File Catalog       info page: https://uimon.cern.ch/twiki/bin/view/LCG/LfcAdminGuide       MySQL database used       64-bit Intel Xeon 3 GHz       160 GB SATA disks (2), software raid-1       2 GB RAM       Running SL 305 64 bit

vobox: VO Box Virtual Organization Box        info page: http://agenda.nikhef.nl/askArchive.php?base=agenda&categ=a0613&id=a0613s3t1/transparencies       64-bit Intel Xeon 3 GHz       160 GB SATA disks (2), software raid-1       2 GB RAM       Running SL 305 64 bit

sc1-sc3: dCache Storage Elements       64-bit Intel Xeons 3 GHz       3ware Raid Controller 8x 232 GB disks in H/W Raid-5 giving 1.8 TB storage       2 GB RAM       Running SL 41 64 bit

sc4: SRM endpoint, dCache Admin node and Storage Element       64-bit Opteron 246       3ware Raid Controller 2x 232 GB disks in H/W Raid-1 giving 250 GB storage       2 GB RAM       Running SL 41 64 bit       IBM 4560-SLX Tape Library attached.  We are moving both SDLT-I       drives to this unit.  SDLT-I does 160 GB native, 300 compressed.

Service Challenge Servers Details

Page 21: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

Dario Barberis: ATLAS SC4 Plans

WLCG SC4 Workshop - Mumbai, 12 February 2006

https://twiki.cern.ch/twiki/pub/LCG/TalksAndDocuments/sc4-expt-plans.ppt

ATLAS SC4 Tests

Complete Tier-0 test Internal data transfer from “Event Filter” farm to Castor disk pool, Castor

tape, CPU farm

Calibration loop and handling of conditions data Including distribution of conditions data to Tier-1s (and Tier-2s)

Transfer of RAW, ESD, AOD and TAG data to Tier-1s

Transfer of AOD and TAG data to Tier-2s

Data and dataset registration in DB (add meta-data information to meta-data DB)

Distributed production Full simulation chain run at Tier-2s (and Tier-1s)

Data distribution to Tier-1s, other Tier-2s and CAF

Reprocessing raw data at Tier-1s Data distribution to other Tier-1s, Tier-2s and CAF

Distributed analysis “Random” job submission accessing data at Tier-1s (some) and Tier-2s

(mostly)

Tests of performance of job submission, distribution and output retrieval

Page 22: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

Dario Barberis: ATLAS SC4 Plans

WLCG SC4 Workshop - Mumbai, 12 February 2006

https://twiki.cern.ch/twiki/pub/LCG/TalksAndDocuments/sc4-expt-plans.ppt

ATLAS SC4 Plans (1) Tier-0 data flow tests:

Phase 0: 3-4 weeks in March-April for internal Tier-0 tests Explore limitations of current setup

Run real algorithmic code

Establish infrastructure for calib/align loop and conditions DB access

Study models for event streaming and file merging

Get input from SFO simulator placed at Point 1 (ATLAS pit)

Implement system monitoring infrastructure

Phase 1: last 3 weeks of June with data distribution to Tier-1s Run integrated data flow tests using the SC4 infrastructure for data distribution

Send AODs to (at least) a few Tier-2s

Automatic operation for O(1 week)

First version of shifter’s interface tools

Treatment of error conditions

Phase 2: 3-4 weeks in September-October Extend data distribution to all (most) Tier-2s

Use 3D tools to distribute calibration data

The ATLAS TDAQ Large Scale Test in October-November prevents further Tier-0 tests in 2006… … but is not incompatible with other distributed operations

No external datatransfer duringthis phase(?)

Page 23: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

Dario Barberis: ATLAS SC4 Plans

WLCG SC4 Workshop - Mumbai, 12 February 2006

https://twiki.cern.ch/twiki/pub/LCG/TalksAndDocuments/sc4-expt-plans.ppt

ATLAS SC4 Plans (2)

ATLAS CSC includes continuous distributed simulation productions: We will continue running distributed simulation productions all the time

Using all Grid computing resources we have available for ATLAS

The aim is to produce ~2M fully simulated (and reconstructed) events/week from April onwards, both for physics users and to build the datasets for later tests

We can currently manage ~1M events/week; ramping up gradually

SC4: distributed reprocessing tests: Test of the computing model using the SC4 data management

infrastructure Needs file transfer capabilities between Tier-1s and back to CERN CAF

Also distribution of conditions data to Tier-1s (3D)

Storage management is also an issue

Could use 3 weeks in July and 3 weeks in October

SC4: distributed simulation intensive tests: Once reprocessing tests are OK, we can use the same infrastructure to

implement our computing model for simulation productions As they would use the same setup both from our ProdSys and the SC4 side

First separately, then concurrently

Page 24: TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost Update since Hepix Fall 2005.

Dario Barberis: ATLAS SC4 Plans

WLCG SC4 Workshop - Mumbai, 12 February 2006

https://twiki.cern.ch/twiki/pub/LCG/TalksAndDocuments/sc4-expt-plans.ppt

ATLAS SC4 Plans (3)

Distributed analysis tests:

“Random” job submission accessing data at Tier-1s (some) and Tier-2s (mostly)

Generate groups of jobs and simulate analysis job submission by users at home sites

Direct jobs needing only AODs as input to Tier-2s

Direct jobs needing ESDs or RAW as input to Tier-1s

Make preferential use of ESD and RAW samples available on disk at Tier-2s

Tests of performance of job submission, distribution and output retrieval

Test job priority and site policy schemes for many user groups and roles

Distributed data and dataset discovery and access through metadata, tags, data catalogues.

Need same SC4 infrastructure as needed by distributed productions

Storage of job outputs for private or group-level analysis may be an issue

Tests can be run during Q3-4 2006

First a couple of weeks in July-August (after distributed production tests)

Then another longer period of 3-4 weeks in November