Southgrid Technical Meeting Pete Gronbech: 26 th August 2005 Oxford.
-
Upload
meredith-bradley -
Category
Documents
-
view
215 -
download
1
Transcript of Southgrid Technical Meeting Pete Gronbech: 26 th August 2005 Oxford.
Present
• Pete Gronbech – Oxford• Ian Stokes-Rees - Oxford• Chris Brew – RAL PPD• Santanu Das - Cambridge• Yves Coppens - Birmingham
Agenda
• Chat• 10:30 Coffee• Pete + Others• 1pm Lunch• Ineteractive Workshop!!• 3:15pm Coffee• Finish
Southgrid Member Institutions
• Oxford • RAL PPD• Cambridge • Birmingham• Bristol• HP-Bristol• Warwick
Stability, Throughput and Involvement
• The last Quarter has been a good stable period for SouthGrid
• Addition of Bristol PP• 4 out of 5 already upgraded to 2_6_0• Large involvement in Biomed DC
Monitoring
• http://www.gridpp.ac.uk/ganglia/• http://map.gridpp.ac.uk/• http://lcg-testzone-reports.web.cern.ch/lcg-
testzone-reports/cgi-bin/lastreport.cgi• Configure view UKI• http://www.physics.ox.ac.uk/users/gronbech/gridmon.htm• Dave Kants helpful doc in the minutes of a tbsupport meeting
links to• http://goc.grid-support.ac.uk/gridsite/accounting/tree/
gridpp_view.php
Status at RAL PPD
• SL3 cluster on 2.6.0 • CPUs: 11 2.4 GHz, 33 2.8GHz
– 100% Dedicated to LCG
• 0.7 TB Storage– 100% Dedicated to LCG
• Configured 6.4TB of IDE RAID disks for use by dcache
• 5 systems to be used for preprodution testbed
RAL 2
• Dcache installation• Pre Production?• Upgrade to 2_6_0 report
– RGMA mon node in early yaim did not work for upgrade (only fresh installs) Problems with connector order (Tomcat openssl before insecure connector, if no cert then had to fix), Latest release of yaim and CERT is OK
– yum name changes caused some problems perl api rpm needs deleting.
Status at Cambridge
• Currently LCG 2.6.0 on SL3
• CPUs: 42 2.8GHz (Extra Nodes only 2/10 any good)– 100% Dedicated to LCG
• 2 TB Storage (have 3 but only 2 available)– 100% Dedicated to LCG
• Condor Batch System• Lack of Condor support
from LCG teams
Cambridge 2
• Cam Grid – LCG interaction– All nodes would need LCG wn software in order to make them
available everywhere– gridpp ce has central manager – nodes have 2 ips , one cambridge private and one lcg public ip.– condor user1 and condor user2 – atlas jobs not working due to software is installed but not verified.
Due to above probs.• Condor Issues• Monitoring / Accounting
– ganglia installed nearly ready, need to inform A McNab.• Upgrade (260) report
– same rpm probs– rgma fixes from Yves– Tomcat– overall quite easy cf previous releases
Status at Bristol
• Status– Yves and Pete Installed SL304 and LCG-2_4_0 and went live on
July 5th 2005. Yves upgraded to 2_6_0 in last week of July as part of pre-release testing.
• Existing resources– 80-CPU BaBar farm moved to Birmingham– GridPP nodes plus local cluster nodes used to bring site on line.
Local cluster needs to be integrated. • New resources
– Funding now confirmed for large University investment in hardware
– Includes CPU, high quality and scratch disk resources• Humans
– New system manager post (RG) should be in place.– New SouthGrid support / development post (GridPP / HP) being
filled– HP have moved ia64 bit machines on to Cancer research due to
lack of use by LCG.
Status at Birmingham
• Currently SL3 with LCG-2_6_0
• CPUs: 24 2.0GHz Xenon (+48 local nodes which could in principle be used but…)– 100% LCG
• 1.8TB Classic se– 100% LCG.
• Babar Farm moving to SL3 and Bristol integrated but not yet on LCG
Birmingham 2
• Babar Cluster expansion• LCG-2_6_0 early testing July • Involvement in Pre Production Grid• Installation of DPM?
– How to migrate data– or just close old se
• Integration of Local users vs grid users
Status at Oxford
• Currently LCG 2.4.0 on SL304• All 74 cpus’s running since ~June 20th• CPUs: 80 2.8 GHz
– 100% LCG• 1.5 TB Storage – second 1.5TB will be brought on line
as DPM or dcache.– 100% LCG.
• Some further Air Conditioning Problems now resolved for Room 650, Second rack in overheating basement.
• Heavy use by Biomed during their DC• Plan to give local users access
Oxford 2
• Need to upgrade to 2_6_0 next week.• Early testing of 2_6_0 in July on tbce01• Integration with pp cluster to give local
access to grid queues
Security
• Best practices linkhttps://www.gridpp.ac.uk/deployment/security/index.html
• Wiki entryhttp://goc.grid.sinica.edu.tw/gocwiki/AdministrationFaq
• iptables?? – Birmingham to share their setup on the South Grid web pagesCompleted
Action Plan for Bristol
• Plan to visit on June 9th to install an installation server– dhcp server– NFS copies of SL (local mirror)– PXE boot setup etc
• Second visit to reinstall head nodes with SL304 and LCG-2_4_0 and some worker nodes
• Babar cluster to go to Birmingham– Fergus, Chris, Yves to Liaise.–
Completed
Action plan for SouthGRID
• Ensure all upto date for GridPP14 (Oxford)
• SRM installations• SC4 Preparations• LHC DC Awareness
Grid site wiki
• http://www.gridsite.org/wiki/main_page• http://www.physics.gla.ac.uk/gridpp/
datamanagement
LCG Deployment Schedule
SC2SC3
LHC Service OperationFull physics run
2005 20072006 2008
First physicsFirst beams
cosmics
June05 - Technical Design Report
Sep05 - SC3 Service Phase
May06 –SC4 Service Phase starts
Sep06 – Initial LHC Service in stable operation
SC4
Apr07 – LHC Service commissioned
Apr05 – SC2 Complete
Jul05 – SC3 Throughput Test
Apr06 – SC4 Throughput Test
Dec05 – Tier-1 Network operational
preparationsetupservice
SC2SC2SC3SC3
LHC Service OperationLHC Service OperationFull physics run
2005 20072006 2008
First physicsFirst beams
cosmicsFull physics run
2005 20072006 20082005 20072006 2008
First physicsFirst beams
cosmics
June05 - Technical Design Report
Sep05 - SC3 Service Phase
May06 –SC4 Service Phase starts
Sep06 – Initial LHC Service in stable operation
SC4SC4
Apr07 – LHC Service commissioned
Apr05 – SC2 Complete
Jul05 – SC3 Throughput Test
Apr06 – SC4 Throughput Test
Dec05 – Tier-1 Network operational
preparationsetupservice
preparationsetupservice
Overall Schedule (Raw-ish)
Sep Sep Oct Oct Nov Nov Dec Dec
ALICE ALICE
ATLAS ATLAS
CMS CMS CMS CMS
LHCb LHCb
Sep Sep Oct Oct Nov Nov Dec Dec
ALICE ALICE
ATLAS
ATLAS
CMS CMS CMS CMS
LHCb LHCb
Service Challenge 4 – SC4
• SC4 starts April 2006• SC4 ends with the deployment of the FULL PRODUCTION SERVICE
Deadline for component (production) delivery: end January 2006
• Adds further complexity over SC3– Additional components and services– Analysis Use Cases– SRM 2.1 features required by LHC experiments– All Tier2s (and Tier1s…) at full service level– Anything that dropped off list for SC3…– Services oriented at analysis and end-user– What implications for the sites?
• Analysis farms:– Batch-like analysis at some sites (no major impact on sites)– Large-scale parallel interactive analysis farms and major sites– (100 PCs + 10TB storage) x N
• User community:– No longer small (<5) team of production users– 20-30 work groups of 15-25 people– Large (100s – 1000s) numbers of users worldwide
SC4 Timeline
• September 2005: first SC4 workshop(?) – 3rd week September proposed
• January 31st 2006: basic components delivered and in place
• February / March: integration testing
• February: SC4 planning workshop at CHEP (w/e before)
• March 31st 2006: integration testing successfully completed
• April 2006: throughput tests
• May 1st 2006: Service Phase starts (note compressed schedule!)
• September 1st 2006: Initial LHC Service in stable operation
• Summer 2007: first LHC event data