WLCG Service Schedule LHC schedule: what does it imply for SRM deployment? WLCG Storage Workshop...

download WLCG Service Schedule LHC schedule: what does it imply for SRM deployment? WLCG Storage Workshop CERN, July 2007.

If you can't read please download the document

description

LHC commissioning - CMS June /6/2007 LHC Schedule.. Operation testing of available sectors Mar. Apr. May Jun. Jul. Aug. Oct. Sep. Nov. Dec. Jan. Apr. Feb. Mar. May Jun. Jul. Aug. Sep. Nov. Dec. Oct. Mar. Apr. May Jun. Jul. Aug. Oct. Sep. Nov. Dec. Jan. Apr. Feb. Mar. May Jun. Jul. Aug. Sep. Nov. Dec. Oct Machine Checkout Beam Commissioning to 7 TeV Consolidation Interconnection of the continuous cryostat Leak tests of the last sub-sectors Inner Triplets repairs & interconnections Global pressure test &Consolidation Flushing Cool-down Warm up Powering Tests

Transcript of WLCG Service Schedule LHC schedule: what does it imply for SRM deployment? WLCG Storage Workshop...

WLCG Service Schedule LHC schedule: what does it imply for SRM deployment? WLCG Storage Workshop CERN, July 2007 Agenda The machine The experiments The service LHC commissioning - CMS June /6/2007 LHC Schedule.. Operation testing of available sectors Mar. Apr. May Jun. Jul. Aug. Oct. Sep. Nov. Dec. Jan. Apr. Feb. Mar. May Jun. Jul. Aug. Sep. Nov. Dec. Oct. Mar. Apr. May Jun. Jul. Aug. Oct. Sep. Nov. Dec. Jan. Apr. Feb. Mar. May Jun. Jul. Aug. Sep. Nov. Dec. Oct Machine Checkout Beam Commissioning to 7 TeV Consolidation Interconnection of the continuous cryostat Leak tests of the last sub-sectors Inner Triplets repairs & interconnections Global pressure test &Consolidation Flushing Cool-down Warm up Powering Tests LHC commissioning - CMS June /6/ LHC Accelerator schedule LHC commissioning - CMS June /6/ LHC Accelerator schedule Machine Summary No engineering run in 2007 Startup in May 2008 we aim to be seeing high energy collisions by the summer. No long shutdown end 2008 See also DGs talk on Experiments Continue preparations for Full Dress Rehearsals Schedule from CMS is very clear: CSA07 runs September 10 for 30 days Ready for cosmics run in November Another such run in March ALICE have stated FDR from November Expecting concurrent exports from ATLAS & CMS end July 1GB/s from ATLAS, 300MB/s from CMS Bottom line: continuous activity post CHEP likely to be (very) busy CERN, June 26,2007Software & Computing Workshop8 ATLAS Event sizes We already needed more hardware in the T0 because In the TDR there was no full ESD copy to BNL included Transfers require more disk servers than expected 10% less disk space in CAF From TDR: RAW=1.6 MB, ESD=0.5 MB, AOD=0.1 MB 5-day buffer at CERN 127 Tbyte; Currently 50 disk servers 300 TByte For Release 13: RAW=1.6 MB, ESD=1.5 MB, AOD=0.23 MB (incl. trigger&truth) 2.2 3.3 MB = 50% more at T0 3 ESD, 10 AOD copies: 4.1 8.4 MB = factor 2 more for exports More disk servers needed for T0 internal and exports 40% less disk in CAF Extra tapes and drives 25% cost increase Have to be taken away from CAF again Also implications for T1/2 sites Can store 50% less data Goal: run this summer 2 weeks uninterrupted at nominal rates with all T1 sites Event sizes from cosmic run ~8MB (no zero suppression) CERN, June 26,2007Software & Computing Workshop9 Tier-1 SiteEfficiency (%) Average Thruput (MB/s) Nominal Rate (MB/s) 50% of nominal achieved 100% of nominal achieved 150% of nominal achieved 200% of nominal achieved ASGC0060 BNL CNAF FZK Lyon NDGF PIC RAL00100 SARA Triumf ATLAS T0 T1 Exports situation at May 28/ Services Schedule Q: What do you (CMS) need for CSA07? A: Nothing would like FTS 2.0 at Tier1s (and not too late) but not required for CSA07 to succeed Trying to ensure that this is done at CMS T1s Other major residual service: SRM v2.2 Windows of opportunity: post CSA07, early 2008 Q: How long will SRM 1.1 services be needed? 1 week? 1 month? 1 year? LHC annual schedule has significant impact on larger service upgrades / migrations cf COMPASS triple migration S.W.O.T. Analysis of WLCG Services StrengthsWe do have a service that is used, albeit with a small number of well known and documented deficiencies (with work-arounds) WeaknessesContinued service instabilities; holes in operational tools & procedures; ramp-up will take at least several (many?) months more ThreatsHints of possible delays could re-ignite discussions on new features OpportunitiesMaximise time remaining until high-energy running to: 1.) Ensure all remaining residual services are deployed as rapidly as possible, but only when sufficiently tested & robust; 2.) Focus on smooth service delivery, with emphasis on improving all operation, service and support activities. All services (including residual) should be in place no later than Q1 2008, by which time a marked improvement in the measurable service level should also be achievable. LCG Steep ramp-up still needed before first physics run 4x 6x Evolution of installed capacity from April 06 to June 07 Target capacity from MoU pledges for 2007 (due July07) and 2008 (due April 08) WLCG Service: S / M / L vision Short-term: ready for Full Dress Rehearsals now expected to fully ramp-up ~mid-September (>CHEP) The only thing I see as realistic on this time-frame is FTS 2.0 services at WLCG Tier0 & Tier1s Schedule: June 18 th at CERN; available mid-July for Tier1s Medium-term: what is needed & possible for 2008 LHC data taking & processing The remaining residual services must be in full production mode early Q at all WLCG sites! Significant improvements in monitoring, reporting, logging more timely error response service improvements Long-term: anything else The famous sustainable e-Infrastructure ? WLCG Service Deployment Lessons Learnt 14 WLCG Service Deployment Lessons Learnt 15 Types of Intervention 0. (Transparent) load balanced servers / (ices) 1. Infrastructure: power, cooling, network 2. Storage services: CASTOR, dCache 3. Interaction with backend DB: LFC, FTS, VOMS, SAM etc. EGI Preparation Meeting, Munich, March Transparent Interventions - Definition Have reached agreement with the LCG VOs that the combination of hardware / middleware / experiment-ware should be resilient to service glitches A glitch is defined as a short interruption of (one component of) the service that can be hidden at least to batch behind some retry mechanism(s) How long is a glitch? All central CERN services are covered for power glitches of up to 10 minutes Some are also covered for longer by diesel UPS but any non-trivial service seen by the users is only covered for 10 Can we implement the services so that ~all interventions are transparent? YES with some provisos Lessons Learnt from WLCG Service Deployment - 18 Targetted Interventions Common interventions include: Adding additional resources to an existing service; Replacing hardware used by an existing service; Operating system / middleware upgrade / patch; Similar operations on DB backend (where applicable). Pathological cases include: Massive machine room reconfigurations, as was performed at CERN (and elsewhere) to prepare for LHC; Wide-spread power or cooling problems; Major network problems, such as DNS / router / switch problems. Pathological cases clearly need to be addressed too! More Transparent Interventions I am preparing to restart our SRM server here at IN2P3-CC so I have closed the IN2P3 channel on prod-fts-ws in order to drain current transfer queues. I will open them in 1 hour or 2. Is this a transparent intervention or an unscheduled one? A: technically unscheduled, since it's SRM downtime. An EGEE broadcast was made, but this is just an example But if the channel was first paused which would mean that no files will fail it becomes instead transparent at least to the FTS which is explicitly listed as a separate service in the WLCG MoU, both for T0 & T1! i.e. if we can trivially limit the impact of an intervention, we should (c.f. WLCG MoU services at Tier0/Tier1s/Tier2s) WLCG Service Deployment Lessons Learnt 19 Service Review For each service need current status of: Power supply (redundant including power feed? Critical? Why?) Servers (single or multiple? DNS load-balanced? HA Linux? RAC? Other?) Network (are servers connected to separate network switches?) Middleware? (can middleware transparently handle loss of one of more servers?) Impact (what is the impact on other services and / or users of a loss / degradation of service?) Quiesce / recovery (can the service be cleanly paused? Is there built-in recovery? (e.g. buffers) What length of interruption?) Tested (have interventions been made transparently using the above features?) Documented (operations procedures, service information) WLCG Service Deployment Lessons Learnt 20 WLCG Service Deployment Lessons Learnt 21 The Worldwide LHC Computing Grid - - CCP Gyeongju, Republic of Korea Why a Grid Solution? The LCG Technical Design Report lists: 1.Significant costs of [ providing ] maintaining and upgrading the necessary resources more easily handled in a distributed environment, where individual institutes and organisations can fund local resources whilst contributing to the global goal 2. no single points of failure. Multiple copies of the data, automatic reassigning of tasks to resources facilitates access to data for all scientists independent of location. round the clock monitoring and support. Services - Summary Its open season on SPOFs WLCG Service Deployment Lessons Learnt 23 You are a SPOF! You are the enemy of the Grid! You will be exterminated! Seek! Locate! Exterminate! Summary 2008 / 2009 LHC running will be lower than design luminosity (but same data rate?) Work has (re-)started with CMS to jointly address critical services Realistically, it will take quite some effort and time to get services up to design luminosity Questions for this workshop 1.Given the schedule of the experiments and the LHC machine, (when) can we realistically deploy SRM 2.2 in production? 2.What is the roll-out schedule? (WLCG sites by name & possibly VO) 3.How long is the validation period including possible fixes to clients (FTS etc.) 4.For how long do we need to continue to run SRM v1.1 services? Migration issues? Clients? ATLAS Visit For those who have registered, now is a good time to pay the 10 deposit RDV 14:00 Geneva time, CERN reception, B33 Backup Slides Service Progress Summary ComponentSummary updates presented at June GDBJune GDB LFCBulk queries deployed in February, Secondary groups deployed in April. ATLAS and LHCb are currently giving new specifications for other bulk operations that are scheduled for deployment this Autumn. Matching GFAL and lcg-utils changes. DPMSRM 2.2 support released in November. Secondary groups deployed in April. Support for ACLs on disk pools has just passed certification. SL4 32 and 64-bit versions certified apart from vdt (gridftp) dependencies. FTS 2.0Has been through integration and testing including certificate delegation, SRM v2.2 support and service enhancements now being validated in PPS and pilot service (already completed by ATLAS and LHCb); will then be used in CERN production for 1 month (from June 18 th ) before release to Tier-1. Ongoing (less critical) developments to improve monitoring piece by piece continue. 3DAll Tier 1 sites in production mode and validated with respect to ATLAS conditions DB requirements. 3D monitoring integrated into GGUS problem reporting system. Testing to confirm streams failover procedures in next few weeks then will exercise coordinated DB recovery with all sites. Also starting Tier 1 scalability tests with many ATLAS and LHCb clients to have correct DB server resources in place by the Autumn. VOMS rolesMapping to job scheduling priorities has been implemented at Tier 0 and most Tier 1 but behavior is not as expected (ATLAS report that production role jobs map to both production and normal queues) so this is being re-discussed. Service Progress Summary ComponentSummary updates presented at June GDBJune GDB gLite 3.1 WMS WMS passed certification and is now in integration. It is being used for validation work at CERN by ATLAS and CMS with LHCb to follow. Developers at CNAF fix any bugs then run 2 weeks of local testing before giving patches back to CERN. gLite 3.1 CECE still under test with no clear date for completion. Backup solution is to keep the existing 3.0 CE which will require SLC3 systems. Also discussing alternative solutions. SL4 SL3 built SL4 compatibility mode UI and WN released but decision to deploy left to sites. Native SL4 32 WN in PPS now and UI ready to go in. Will not be released to production until after experiment testing is completed. SL4 DPM (needs vdt) important for sites that buy new hardware. SRM 2.2 CASTOR2 work is coupled to the ongoing performance enhancements; dCache 1.8 beta has test installations at FNAL, DESY, BNL, FZK, Edinburgh, IN2P3 and NDGF, most of which also are in the PPS. DAQ-Tier-0 Integration Integration of ALICE with the Tier-0 has been tested with a throughput of 1 GByte/sec. LHCb testing planned for June then ATLAS and CMS from September. OperationsMany improvements are under way for increasing the reliability of all services. See this workshop & also WLCG Collaboration N.B. its not all dials & dashboards!