TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost...
-
Upload
enrique-morss -
Category
Documents
-
view
215 -
download
0
Transcript of TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost TRIUMF SITE REPORT Corrie Kost...
TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost
TRIUMF SITE REPORT
Corrie Kost
Update since Hepix Fall 2005
TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost
Devolving Server Functions
Windows print server cluster - 2 Dell PowerEdge SC1425 machines sharing external SCSI disk holding printers data
• Windows Print Server
• Windows Domain controller
OLDNEW
- 2 Dell PowerEdge SC1425 - primary & secondary Windows domain controllers
NEW
TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost
Update since Hepix Fall 2005
TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost
Update since Hepix Fall 2005
Waiting for 10Gb/sec DWDM XFP
TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost
Update since Hepix Fall 2005
40 km DWDM
64 wavelengths/fibre
CH34=193.4THz 1550.116nm
~ $10kUS each
TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost
Update since Hepix Fall 2005
TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost
Update since Hepix Fall 2005
TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost
Servers / Data Centre
GPS TIMETRPRINTCMMS
TRWEB
TGATEDOCUMENTS
CONDORG
TRSHARE
TRMAIL
TRSERV
LCG
WORKER
NODES
IBM
CLUSTER
1GB 1GB
TEMP TRSHARE
KOPIODOC
LCG STORAGE
IBM ~ 2TB STORAGE
TNT2K3
RH-FC-SL MIRROR
TRWINDATA
TRSWINAPPS
WINPRINT1/2
PROMISE STORAGE
TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost
Update since Hepix Fall 2005
TNT2K3
TRWINDATA
TRSWINAPPS
PROMISE STORAGE
TRPRINT
TRWEB
TGATE
CONDORG
TRSHARE
TRMAIL
TRSERV
DOCUMENTS
CMMS
GPS TIME
WINPRINT1WINPRINT2
TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost
TRIUMF-CERN ATLAS
Lightpath -
InternationalGrid Testbed
(CA*Net IGT)
Equipment
Amanda
Backup
ATLAS
Worker nodes: (evaluation units)
Blades, Dual/Dual 64-bit 3GHz Xeons
4 GB RAM, 80GB Sata
VOBOX: 2GB, 3GHz 64-bit Xeon, 2*160GB SATA
LFC: 2GB, 3GHz 64-bit Xeon, 2*160GB SATA
FTS: 2GB, 3GHz 64-bit Xeon, 3*73GB SCSI
SRM Head node
2GB, 64-bit Opteron 2*232GB RAID1
sc1-sc3 dCache Storage Elements
2GB, 3GHz 64-bit Xeon, 8*232GB RAID5
2 SDLT 160GB drives / 26 Cart
2 SDLT 300GB drives / 26 Cart
TIER1 prototype (Service Challenge)
TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost
ATLAS/CERN TRIUMF
TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost
Site Disk-Disk Disk-Tape
ASGC 100 75
TRIUMF 50 50
BNL 200 75
FNAL 200 75
NDGF 50 50
PIC 60* 60
RAL 150 75
SARA 150 75
IN2P3 200 75
FZK 200 75
CNAF 200 75
Tier0Tier1 Tests Apr 330
• Any MB/sec rates below 90% nominal needs explanation and compensation in days following.
• Maintain rates unattended over Easter weekend (April 14-16)
• Tape tests April 18-24
• Experiment-driven transfers April 25-30
* The nominal rate for PIC is 100MB/s, but will be limited by the WAN until ~November 2006.
https://twiki.cern.ch/twiki/bin/view/LCG/LCGServiceChallenges
TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost
ATLAS’ SC4 Plans – Extracted from Mumbai Workshop 17 Feb/2006(1)
• March-April (pre-SC4): 3-4 weeks in for internal Tier-0 tests (Phase 0)•April-May (pre-SC4): Tests of distributed operations on a “small” testbed (the pre-production system)•Last 3 weeks of June: Tier-0 test (Phase 1) with data distribution to Tier-1s (720MB/s + full ESD to BNL)•3 weeks in July: Distributed processing tests (Part 1)•2 weeks in July-August: Distributed analysis tests (Part 1)•3-4 weeks in September-October: Tier-0 test (Phase 2) with data to Tier-2s•3 weeks in October: Distributed processing tests (Part 2)•3-4 weeks in November: Distributed analysis tests (Part 2)
(1) https://twiki.cern.ch/twiki/bin/view/LCG/SCWeeklyPhoneCon060220
Update from Hepix Fall 2005
TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost
Repeated reads on same set of (typically 16) files (at ~ 600MB/sec) – during ~ 150 days ~ 7 PB (total since started ~13PB to March 30 – no reboot for 134 days)
TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost
Repeated reads on same set of (typically 16) files (at ~ 600MB/sec) – during ~ 150 days ~ 7 PB (total since started ~13PB to March 30 – no reboot for 134 days)
Daily Average (Feb/Mar 2006)
0100200300400500600700
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33
DAY
READ
(MB/
sec)
TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost
Keeping it Cool
•Central Computing Room isolation fixed –
•Combined two 11-Ton air-conditioners to even out load
•Adding heating coil to improve stability
•Blades for Atlas! – 30% less heat, 20% less TCO
100 W/sq-ft 200 W/sq-ft 400 W/sq-ft means cooling costs are a significant cost factor
Note: Electrical/Cooling costs estimated at $Can150k/yr
•Water cooled systems for (multicore/multicpu) blade systems?
TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost
Keeping it Cool2
HP offers Modular Cooling System (MCS)
• Used when rack > 10-15Kw
• US$30K
• Chilled (5-10C) water
• Max load 30Kw/rack (17GPM / 65LPM @ 5C water @ 20C air)
• Water cannot reach servers
• Door open? - Cold air out front, hot out back
• Significantly less noise with doors closed
• HWD 1999x909x1295mm (79”x36”x51”) 513Kg/1130lbs (empty)
• Not certified for Seismic or Zone 4http://h20000.www2.hp.com/bc/docs/support/SupportManual/c00613691/c00613691.pdf
TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost
Amanda Backup at TRIUMF
Details by Steve McDonald Thursday ~ 4:30pm
TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost
End of Presentation
Extra Slides on SC4 plans for reference…
TRIUMF Site Report for HEPiX, CASPUR, April 3-7, 2006 – Corrie Kost
fts: FTS Server FTS= File Transfer Service homepage: http://egee-jra1-dm.web.cern.ch/egee%2Djra1%2Ddm/FTS/ Oracle database used 64-bit Intel Xeon 3 GHz 73 GB SCSI disks (3) 2 GB RAM IBM 4560-SLX Tape Library attached (will have 2 SDLT-II drives attached when they arrive, probably next week) SDLT-II does 300 GB native, 600 compressed. Running SL 305 64 bit
lfc: LFC Server LFC= LCG File Catalog info page: https://uimon.cern.ch/twiki/bin/view/LCG/LfcAdminGuide MySQL database used 64-bit Intel Xeon 3 GHz 160 GB SATA disks (2), software raid-1 2 GB RAM Running SL 305 64 bit
vobox: VO Box Virtual Organization Box info page: http://agenda.nikhef.nl/askArchive.php?base=agenda&categ=a0613&id=a0613s3t1/transparencies 64-bit Intel Xeon 3 GHz 160 GB SATA disks (2), software raid-1 2 GB RAM Running SL 305 64 bit
sc1-sc3: dCache Storage Elements 64-bit Intel Xeons 3 GHz 3ware Raid Controller 8x 232 GB disks in H/W Raid-5 giving 1.8 TB storage 2 GB RAM Running SL 41 64 bit
sc4: SRM endpoint, dCache Admin node and Storage Element 64-bit Opteron 246 3ware Raid Controller 2x 232 GB disks in H/W Raid-1 giving 250 GB storage 2 GB RAM Running SL 41 64 bit IBM 4560-SLX Tape Library attached. We are moving both SDLT-I drives to this unit. SDLT-I does 160 GB native, 300 compressed.
Service Challenge Servers Details
Dario Barberis: ATLAS SC4 Plans
WLCG SC4 Workshop - Mumbai, 12 February 2006
https://twiki.cern.ch/twiki/pub/LCG/TalksAndDocuments/sc4-expt-plans.ppt
ATLAS SC4 Tests
Complete Tier-0 test Internal data transfer from “Event Filter” farm to Castor disk pool, Castor
tape, CPU farm
Calibration loop and handling of conditions data Including distribution of conditions data to Tier-1s (and Tier-2s)
Transfer of RAW, ESD, AOD and TAG data to Tier-1s
Transfer of AOD and TAG data to Tier-2s
Data and dataset registration in DB (add meta-data information to meta-data DB)
Distributed production Full simulation chain run at Tier-2s (and Tier-1s)
Data distribution to Tier-1s, other Tier-2s and CAF
Reprocessing raw data at Tier-1s Data distribution to other Tier-1s, Tier-2s and CAF
Distributed analysis “Random” job submission accessing data at Tier-1s (some) and Tier-2s
(mostly)
Tests of performance of job submission, distribution and output retrieval
Dario Barberis: ATLAS SC4 Plans
WLCG SC4 Workshop - Mumbai, 12 February 2006
https://twiki.cern.ch/twiki/pub/LCG/TalksAndDocuments/sc4-expt-plans.ppt
ATLAS SC4 Plans (1) Tier-0 data flow tests:
Phase 0: 3-4 weeks in March-April for internal Tier-0 tests Explore limitations of current setup
Run real algorithmic code
Establish infrastructure for calib/align loop and conditions DB access
Study models for event streaming and file merging
Get input from SFO simulator placed at Point 1 (ATLAS pit)
Implement system monitoring infrastructure
Phase 1: last 3 weeks of June with data distribution to Tier-1s Run integrated data flow tests using the SC4 infrastructure for data distribution
Send AODs to (at least) a few Tier-2s
Automatic operation for O(1 week)
First version of shifter’s interface tools
Treatment of error conditions
Phase 2: 3-4 weeks in September-October Extend data distribution to all (most) Tier-2s
Use 3D tools to distribute calibration data
The ATLAS TDAQ Large Scale Test in October-November prevents further Tier-0 tests in 2006… … but is not incompatible with other distributed operations
No external datatransfer duringthis phase(?)
Dario Barberis: ATLAS SC4 Plans
WLCG SC4 Workshop - Mumbai, 12 February 2006
https://twiki.cern.ch/twiki/pub/LCG/TalksAndDocuments/sc4-expt-plans.ppt
ATLAS SC4 Plans (2)
ATLAS CSC includes continuous distributed simulation productions: We will continue running distributed simulation productions all the time
Using all Grid computing resources we have available for ATLAS
The aim is to produce ~2M fully simulated (and reconstructed) events/week from April onwards, both for physics users and to build the datasets for later tests
We can currently manage ~1M events/week; ramping up gradually
SC4: distributed reprocessing tests: Test of the computing model using the SC4 data management
infrastructure Needs file transfer capabilities between Tier-1s and back to CERN CAF
Also distribution of conditions data to Tier-1s (3D)
Storage management is also an issue
Could use 3 weeks in July and 3 weeks in October
SC4: distributed simulation intensive tests: Once reprocessing tests are OK, we can use the same infrastructure to
implement our computing model for simulation productions As they would use the same setup both from our ProdSys and the SC4 side
First separately, then concurrently
Dario Barberis: ATLAS SC4 Plans
WLCG SC4 Workshop - Mumbai, 12 February 2006
https://twiki.cern.ch/twiki/pub/LCG/TalksAndDocuments/sc4-expt-plans.ppt
ATLAS SC4 Plans (3)
Distributed analysis tests:
“Random” job submission accessing data at Tier-1s (some) and Tier-2s (mostly)
Generate groups of jobs and simulate analysis job submission by users at home sites
Direct jobs needing only AODs as input to Tier-2s
Direct jobs needing ESDs or RAW as input to Tier-1s
Make preferential use of ESD and RAW samples available on disk at Tier-2s
Tests of performance of job submission, distribution and output retrieval
Test job priority and site policy schemes for many user groups and roles
Distributed data and dataset discovery and access through metadata, tags, data catalogues.
Need same SC4 infrastructure as needed by distributed productions
Storage of job outputs for private or group-level analysis may be an issue
Tests can be run during Q3-4 2006
First a couple of weeks in July-August (after distributed production tests)
Then another longer period of 3-4 weeks in November