efi.uchicago.educi.uchicago.edu
FAX status reportIlija Vukoticon behalf of the atlas-adc-federated-xrootd working group
S&C weekJun 2, 2014
efi.uchicago.educi.uchicago.edu
2
Content
• Status– Coverage– Traffic– Failover– Overflow
• Changes in localSetupFAX • Monitoring changes
– Changes in GLED collector, dashboard– Failover & overflow monitoring– FaxStatusBoard
• Meetings – Tutorial – 23 -27 June – dedicated to instructing on xAOD and the
new analysis model – ROOTIO – 25-27 June
efi.uchicago.educi.uchicago.edu
3
FAX topology
Topology change in North America• added East and
West• will serve CA cloud• all hosted at BNL
Will need NL cloud redirector
efi.uchicago.educi.uchicago.edu
4
FAX in Europe
To come:SaraNikhefIL cloud - IL-TAU, Technion, Weizmann
efi.uchicago.educi.uchicago.edu
5
FAX in North America To come:TRIUMF (June?)McGill (end of June)SCINET (end of June)Victoria (~August)
efi.uchicago.educi.uchicago.edu
6
FAX in Asia
To come:Beijing (~two weeks)TokyoAustralia (few weeks)
efi.uchicago.educi.uchicago.edu
7
Status
• Most sites running stably• Glitches do happen but
are fixed usually in few hours
• SSB issues solved• New sites added
– IFAE– PIC– IN2P3-LPC
• In need of restart:– UNIBE-LHEP
efi.uchicago.educi.uchicago.edu
8
Coverage
• Now auto-updated Twiki page– https://twiki.cern.ch/twiki/bin/view/AtlasComputing/FaxCoverage
• Coverage is good (~85%), but we should aim for >95% !• Info fetched from
http://dashb-atlas-job-prototype.cern.ch/dashboard/request.py/dailysummary
efi.uchicago.educi.uchicago.edu
9
Traffic• Slowly increasing• Max peak output record broken• Still small to what we expect will come
efi.uchicago.educi.uchicago.edu
10
Failover • Running stably
efi.uchicago.educi.uchicago.edu
11
Overflow status
• All the chain ready
• I have set all the US queues to allow 3 Gbps to be both delivered to and delivered from sites.
• Test tasks submitted to sites that don’t have the data so that transfertype=FAX is invoked.
• This does not test the JEDI decision making (the one based on cost matrix)
• Waiting for actual jobs to check the full chain– Users not yet instructed to use JEDI client
– Waiting for JEDI monitor
efi.uchicago.educi.uchicago.edu
12
Overflow tests
• Test is the hardest IO test – 100% events, all branches read, standard TTC/no AsyncPrefetch.
• Site specific FDR datasets (10 DSs, 744 files, 2.7TB) • All the source/destination combinations of US sites• All of it submitted in 3 batches, but not all started
simultaneously. Affected by priority degradation.• Three input files per job. • If site is copy2scratch pilot does xrdcp to scratch, if
not jobs access files remotely.
efi.uchicago.educi.uchicago.edu
13
Overflow tests
• Error rate– Total 9188 jobs– Finished 9052– Failed 117 – 1.3%
o 24 – OU reading OU (no FAX involved)o 66 – reading from WT2 (files are corrupted)o 27 – 0.29 % -actual FAX errors where SWT2 did not
deliver the files. Will be investigated.o The rest are “Payload run out of memory”
efi.uchicago.educi.uchicago.edu
14
Overflow tests
• Jobs reading from local scratch - for comparison
Direct access site Reading locallyPer job:• 7.2 MB/s• 67% CPU eff• 71 ev/s
Scout jobsScout jobs
Copy2scratch site
Per job:• 11.0 MB/s• 97% CPU eff• 109 ev/s
efi.uchicago.educi.uchicago.edu
15
Overflow tests
• Jobs reading remote sources
Direct access site Reading remotelyPer job:• 4.2 MB/s• 43% CPU eff• 42 ev/s
Direct access siteReading remotelyPer job:• 3.5 MB/s• 29% CPU eff• 34 ev/s
No saturationPossibly a start of saturation
efi.uchicago.educi.uchicago.edu
16
Overflow tests
• MWT2 reading from OU and SWT2 simultaneously• In aggregate reached 850 MB/s – limit for MWT2 at that
time.
efi.uchicago.educi.uchicago.edu
17
Cost matrix
destination
sour
ce
http://1-dot-waniotest.appspot.com/
efi.uchicago.educi.uchicago.edu
18
localSetupFAX
• Added command fax-ls – Made by Shuwei YE.– Will finally replace isDSinFAX– He will move all the other tools to Rucio
• Change in fax-get-best-redirector– Each time does three queries
o SSB to get endpoints and their statuso AGIS to get sites, hosting the endpointso AGIS to get site coordinates
– Each call returns hundreds of kb’s – Can’t scale to large number of requests– Solution:
o Made a GoogleAppEngine servlets that each 30 min take info from SSB and AGIS and deliver it from memory
o Information slimmed to what is actually needed: ~several kbo Now requests served in few tens of ms.o “Infinitely” scalable
efi.uchicago.educi.uchicago.edu
19
Monitoring – collector, dashboard• Problem: support of multi-VO sites• Meeting: Alex, Matevz, me• Issues:
– Site name: o ATLAS reports it o CMS not or badly, will fix it
– Requesting user’s VOo ATLAS does ito CMS not strict about it. US-CMS uses GUMS. Will fix it.
• Proposal:– During the summer Matevz develops XrdMon that can handle multi-VO
messages– Sends messages from multi-VO sites to a special “mixed” AMQ. Dashboard
splits traffic according to user’s VO.Details:https://docs.google.com/document/d/1Syx3_vkwCfc5lj2lQzbUUrKT0Je238w6lcwVL7IY1GY/edit#
efi.uchicago.educi.uchicago.edu
20
Monitoring
• Failover– Not flexible enough
• Overflow– No monitoring yet– Need to compare jobs grouped by transfer type
Top Related