Introduction T2s - indico.in2p3.fr
Transcript of Introduction T2s - indico.in2p3.fr
CAF_T2_141116 1
Introduction T2sL. Poggioli, LAL
• Recent
– S&C week @CERN, 26-30/09
– CHEP2016
– Pledges 2017 & 2018 revisited
• RRB feedback
•ATLAS policy
• Next
– Sites jamboree 18-20 January @ CERNhttps://indico.cern.ch/event/579473/
Luc
Resource usage in 2016
CAF_T2_141116 Luc 2
Distributed processingData and MC
Heterogeneous resource
Resource usage
CAF_T2_141116 Luc 6
• T1s are full (88%)– 5-10% cannot be used
(tape buffers)
• Fraction of secondary
• T2s are FINALLY full – Old recurrent pb
(Full at 78%)
• Large fraction of secondary
WORLD Cloud (1)
CAF_T2_141116 Luc 7
See Barreiro et al., CHEP2016
• Fully activated end March 2016
• Going definitely away from MONARC model
• Dynamic, tasks not confined to a cloud. Group of processing sites defined dynamically per task
• Task nucleus– Task brokerage choose nucleus for each task wrt
data locality, queued work & available storage
– T1s and the bigger T2s are defined as nuclei
– Output aggregated in task nucleus
WORLD Cloud (2)
CAF_T2_141116 Luc 8
• Task satellites– Run jobs and ship the output to the nucleus
– Job brokerage selects satellites for each task, based on usual criteria (#jobs, data availability)
– Satellites are selected worldwide: a network weight matches well connected nuclei & satellites
• Nuclei http://adc-ddm-mon.cern.ch/ddmusr01/NUCLEUS_DATADISK.html
– Currently T1s and ~20% of T2s Better T2 disk usage!!
– 65% datadisk in nuclei, aim to increase to ~80%• Today: CC, Tokyo, LAPP for FR
Run-2: 2017 & 2018
• LHC delivered 50% more data in 2016– Expect the same for 2017 & 2018
• -> New input parameters for CRSG
-> New pledges requests (for the 4 expts)CAF_T2_141116 Luc 11
Requests for 2017 & 2018
50% more data but only 20% increase
(disk & CPU)
– Majority of resource for MC production• NYEAR-N (FullSim) = 3.5B + 0.3B * NYEAR-N(Data)
• NYEAR-N (FastSim) = F * NYEAR-N(FullSim) F = 0.6 in 2017, F = 0.7 in 2018
– Tape request reduced from experience with lifetime model
CAF_T2_141116 Luc 12
FLAT BUDGET modelno more valid
CRSG outcome• For 2017: Endorse ATLAS requests!
– Crucial: Requested resources to be available at T0. Highest priority: tape at T0 & T1s for data and MC
– Essential: large use of beyond-pledge CPU resources for full simulation
This does not guarantee Funding Agencies to be able to fulfill requests (OK for major FAs, not France today )
• For 2018– Evaluate impact of parking data
– Evaluate impact of reduction #MC eventsCAF_T2_141116 Luc 13
If +20% resource not available (1)• In France (under discussion with LCG-FR)
– In flat budget scenario already unable to fulfill2017 April (except for disk at T2s)
• ATLAS model has very little contingency– Many aspects have been optimized (#copies,
dynamic data placement, lifetime model)
• In practice, reduction in our resources required -> production of fewer MC events – ATLAS grid dominated by MC simulation
• If lack of resources Collaborative effort across: S&C,Trigger, DataPrep, PC, subdet.
CAF_T2_141116 Luc 14
If +20% resource not available (2)• 2 ‘options’
– Reducing the HLT output rate to 750Hz -> stronger impact on physics program
– Parking data until LS2 -> negative impact on ATLAS students and require unexpected extra resources during LS2
• If not get more resources & maintain 1kHz HLT & process all data we can produce 4.5B FullSim in 2017– For comparison, need is 5.9 for 2017
– In 2016 will produce approximately 5.2B
CAF_T2_141116 Luc 15
ATLAS policy for 2017
CAF_T2_141116 Luc 16
• In case +20% not achievable (realistic)
• T1– Favor disk wrt CPU
• Allow to better benefit from opportunistic & pledged CPU resource
– Situation ~OK for tapes
• T2s– For Nuclei-like: favor disk
– For satellite-like: favor CPU
Balancing CPU/Disk is obsolete
CAF_T2_141116 Luc 17
Activities since last CAF (1)
Average 1.1M
jobs/day
(was 1.05M)
CAF_T2_141116 Luc 18
Activities since last CAF (2)
Running slots
• Constant > 220k running slots, up to 300k– MC simu decrease (end campaign) & Dip 1 25/10 (CentOS vuln.)
• Dominated by MC simulation (Less MC Reco)
CAF_T2_141116 Luc 19
Activities since last CAF (3)
FR cloud 11.1% (last period 10.0)
Walltime /processing cloud
CAF_T2_141116 Luc 20
MCORE (Production only) • No more quota required
– eg 80% obsolete
– Just ‘dynamic’ handling performing well
• FR-cloud: 13.4% in WT (last period 10.0%)• (CERN-T0 -3%)
FR-Cloud
CAF_T2_141116 Luc 21
WT all cloud
• CC (-9%) But last period higher by 6% wrt normal
• Tokyo, GRIF sites, CPPM, LAPP
CAF_T2_141116 Luc 22
Transfers FR as source
FR as destination
CPPM, CC
RO-02, RO-07
FR-sites availability (ATLAS_CRITICAL)
CAF_T2_141116 Luc 23
• All sites >90%, but RO-16
& RO-14
CAF_T2_141116 Luc 24
Sites ranking ASAP (ICB) Analysis availability Online (since last CAF)
Integr’ed over 2 months
All above 90%!! Instabilities: LAL
Issues (1)• CPPM
– GGUS:124652 Failing transfers
• LAPP– UDT cooling failure
• LPC– GGUS:123227 Deletion errors
– Renater network problem
• LPNHE– GGUS:124043 Deletion errors. Storage server pb
• LAL– GGUS:124726 Deletion errors
CAF_T2_141116 Luc 27
Issues (2)
• IRFU– Cooling issue Disk OK but CPU@30% OK now
– GGUS:124532 Failing transfers (reverse DNS lookup broken for IPv6)
• RO-14– DATADISK full
• RO-16– GGUS:124175 Squid down
• RO-07– GGUS:124939 Failing transfers (Disk full)
CAF_T2_141116 Luc 28
CAF_T2_141116 Luc 29
Ongoing: hpc progress • hpc used today
– US, China, Europe (Germany)
– Still under development
– Used in backfill modes
• In France– Working group ATLAS/CCIN2P3/IDRIS
(Orsay)• Work on a demonstrator
• Architecture PowerPC not favorable
– Contact with TGCC (Saclay)• Architecture OK but machine full
• Ongoing
CAF_T2_141116 Luc 30
Ongoing • General
– Pledges revision for 2017 & 2018
– AFS_GROUPDIR removal OK
– Archiving to tape old files (sps, LOCAL)
– IPV6?
– hpc
• Sites– Critical kernel vulnerability OK
– RO-LCG Federation review• Document supplied to reviewers (A. Filipcic, LP)
• Report provided by reviewers