CMS UK Computing - Indico...QMUL - 14,934 N/A 2 TB RHUL - Opportunistic - 0 Oxford - Opportunistic -...
Transcript of CMS UK Computing - Indico...QMUL - 14,934 N/A 2 TB RHUL - Opportunistic - 0 Oxford - Opportunistic -...
Katy Ellis
28th August 2019
GriddPP43
CMS usage of UK resources
Introduction
• Katy Ellis
• New(ish) CMS / RAL Tier 1 Liaison
• Started in September 2018
• New to CMS (PhD on ATLAS, 2012)
• Included in my role – operations, improving job efficiency/failure rate, “projects”• DOMA TPC, Rucio integration for CMS with CTA, XRootD investigations.
Contents
• CMS experiment upgrades in LS2
• Review of CMS UK computing resources
• News from Tier 2s
• Additional CMS resources
• Progress with Rucio for CMS
LHC schedule
CMS current status
• Preparing for Run 3• Hardware upgrades
• Computing upgrades
• Preparing for the longer term future• Long Shutdown 3 activities already
being planned in detail.
• HL-LHC civil-engineering work is ongoing since June 2018 - five new buildings on the surface, as well as modifications to the underground cavern and galleries.
Long shutdown 2 CMS upgrades
• Installation of new beampipe
• Replacement pixel detector (innermost layer)
• Upgraded power system for the magnet
• Installation of new multi-GEM chambers for increased coverage of muon detection
CMS UK Computing ResourcesSite CPU (HS06) CPU (HS06) Disk Storage Disk Storage
Pledge Provision Pledge Provision
RAL Tier 1 52,000 61,408 5.44 PB* 5.44 PB*
Brunel London 1.49 PB
Imperial College London 44,198 4.50 PB
RALPP (Tier2) South 24,122 3.73 PB
Bristol 727 TB
QMUL - 14,934 N/A 2 TB
RHUL - Opportunistic - 0
Oxford - Opportunistic - 41 TB
Glasgow - Opportunistic - 174
DODAS N/A Opportunistic - N/A
CMS @ home N/A Volunteered - N/A
* Includes 200 TB tape buffer
Source: REBUS
Total T2 CPU pledge 50 kHS06London CPU 32,641London Disk 2.925 PBSouth CPU 17,326South Disk 975 TB
+ 17.6 PB Tape pledge at T1
CPU in last 6 months – Total completed jobs
Running cores
CMS Pledges
From EGI Accounting
52kHS0648kHS06
T1
T2s
Opp
Total T2 pledge = 50kHS06
CMS Pledges
From EGI Accounting
52kHS0648kHS06
The Katy Effect?
Imperial College news• IC have moved their data centre from Kensington to Slough!
• Coordination by Simon Fayer
• ~ 1 week
• Data centre is now run remotely, but tended by local system administrators - no noticeable difference
RALPP news
• In the last month, RALPP has joined the LHCONE network• LHC Open Network Environment.
• Improve data access by flattening the T1/2/3 hierarchy so that any site may connect with any other.
• One issue connecting with FNAL FTS, but quickly resolved.
• This is a precursor to RAL T1 joining LHCONE – other T1s already connected.
Incorporating additional sites
• Analysis jobs submitted to RAL T1/T2, IC or Brunel will also match Glasgow, Oxford, QMUL and RHUL.
• UK sites are now a mesh structure• ”CMS Tier 3” sites read/write data
from/to RALPP
• CMS are keen to extend this to other non-CMS sites.
Incorporating additional sites
• Analysis jobs submitted to RAL T1/T2, IC or Brunel will also match Glasgow, Oxford, QMUL and RHUL.
• UK sites are now a mesh structure• ”CMS Tier 3” sites read/write data
from/to RALPP
• CMS are keen to extend this to other non-CMS sites.
DODAS – “Dynamic On Demand Analysis Service”
• Start a personal Tier 2 (3?) on the cloud.
• Running in ~20 minutes on OpenStack, Azure, AWS, EGI-clouds.
• Developed in Italy, adapted and tested by Riccardo di Maria at IC.• Looking for a new person.• Tested on a temporary cloud of 800
nodes at IC.
• “Useful if you have a deadline and a credit card”.
CMS@home *
• Volunteers running CMS jobs on their personal computers
• They can view monit plots with CERN or affiliate credentials, or Facebook/Google/etc.
• Almost entirely single-core MC Production jobs
• Planning to make submission more automated.
• Trying to move to SLC7.
* Talk to Ivan Reid if you want to know more
Move to Rucio
• Rucio will replace PhEDEx from Run 3• File transfer service• File catalogue• Highly scalable• Heterogeneous storage systems worldwide• Run centrally
• Used successfully by ATLAS for several years
• Now open to the wider community
• CMS activities include: integration with Production and User Analysis job submission, setup of databases for e.g. User accounts, data synchronization performance testing, setup of monitoring, etc.
Rucio for CMS – current status
• Rucio components are automated in Kubernetes.• Monitoring will be added in Kibana.
• ‘Million file test’ progressing well – being repeated.• Monitoring in early stages…
• Some level of synching on many sites - NanoAOD. • Subscriptions.
Rucio for CMS – Tape
• Able to transfer into all Tier 1 tape systems• RAL was most difficult – different Rucio config
• Have now fixed RAL config, and able to use the tape as source
• Cannot yet use other T1 tapes as source• Hoping the fix for RAL will point towards a solution
• Able to transfer into and out of CERN Tape Archive (CTA)• Must be done via EOS
• Started more substantial tests, ~10 TB
• Waiting for monitoring from CTA (2.155 TB were written in 3 hours)
• Working towards a ~200 TB transfer test
Summary
• CMS is preparing for Run 3 and beyond.
• UK sites are meeting their pledge, and often exceeding it.
• CMS Tier 2s are working well.
• Thanks to other sites for offering spare capacity.
• Tier 1 continues to make improvements.
• Rucio integration and testing for CMS is in full swing.
Backup
VO shares for UK sites
Increase in data rate
Long shutdown activities
• Detector upgrades
• Production
• Tape and disk cleaning
• CERN Tape Archive
• Rucio
CMS detector upgrades for Run 3 Further upgrades for HL-LHC in Runs 4 and 5 already in detailed planning stage via TDRs.
Rucio and CERN Tape Archive (CTA)
• CTA: CERN Tape Archive, which replaces CASTOR at CERN this summer• Meta-data migration only
• Change to the high-level structure? Possible issue with Rucio
• RAL will also be changing tape system in the medium-term• Tender is out
• Useful to gain expertise integrating Rucio with tape systems
• Pre-production service on CTA, with a test Rucio Storage Element (RSE)
Details on CMS T2s (pledge)• Imperial (2200TB)
• Moving the data centre to Slough in June
• 2 * 100 Gb/s network (one is fallback)
• RAL_PP (1100TB, 1600TB imminently) • Connecting to LHCONE soon
• Brunel (500TB)• CMS using close to 100% of storage due to a bug
• Some issues after upgrading to DOME version of DPM. Better testing before deployment would improve this situation.
• 40Gb/s coming in May
‘Old’ CERN CASTOR tape setup
‘New’ CERN CTA tape setup
Why is it so complicated?