Experience of Data Transfer to the Tier-1 from a DIRAC ... · Experience of Data Transfer to the...
Transcript of Experience of Data Transfer to the Tier-1 from a DIRAC ... · Experience of Data Transfer to the...
20-21 October 2015 UK-T0 Workshop 1
Experience of Data Transfer to the Tier-1 from a DIRAC
Perspective
Lydia Heck
Institute for Computational Cosmology
Manager of the DiRAC-2 Data Centric Facility COSMA
20-21 October 2015 UK-T0 Workshop 2
Talk layout
● Introduction to DiRAC ?
● The DiRAC computing systems
● What is DiRAC
● What type of science is done on the DiRAC facility ?
● Why do we need to copy data to RAL?
● Copying data to RAL – network requirements
● Collaboration between DiRAC and RAL to produce the archive
● Setting up the archiving tools
● Archiving
● Open issues
● Conclusions
20-21 October 2015 UK-T0 Workshop 3
Introduction to DiRAC ● DIRAC -- Distributed Research utilising Advanced
Computing established in 2009 with DiRAC-1
● Support of research in theoretical astronomy, particle physics and nuclear physics
● Funded by STFC with infrastructure money allocated from the Department for Business, Innovation and Skills (BIS)
● The running costs, such as staff costs and electricity are funded by STFC
Introduction to DiRAC, cont’d
● 2009 – DiRAC-1
– 8 installations across the UK of which COSMA-4 at the ICC in Durham is one. Still a loose federation.
● 2011/2012 – DiRAC-2
– major funding of £15M for e-Infrastructure
– in bidding to host – 5 installations identified – judged by peers
– for successful bidders scrutiny and interview by representatives for BIS to see if we could deliver by a tight deadline
20-21 October 2015 UK-T0 Workshop 4
Introduction to DiRAC, cont’d
● DiRAC has full management structure.
● Computing time on the DiRAC facility is allocated through a peer-reviewed procedure.
● Current director: Dr Jeremy Yates, UCL
● Current technical director:Prof Peter Boyle, Edinburgh
20-21 October 2015 UK-T0 Workshop 5
The DiRAC computing systems
20-21 October 2015 6 UK-T0 Workshop
Blue Gene Edinburgh
Cosmos Cambridge
Complexity Leicester
Data Centric Durham
Data Analytic Cambridge
The Bluegene @ DiRAC
● Edinburgh – IBM Blue Gene
– 98304 cores
– 1 Pbyte of GPFS storage
– designed around (Lattice)QCD applications
20-21 October 2015 7 UK-T0 Workshop
COSMA @ DiRAC (Data Centric)
● Durham – Data Centric system –IBM IDataplex
– 6720 Intel Sandy Bridge cores
– 53.8 TB of RAM
– FDR10 infiniband 2:1 blocking
– 2.5 Pbyte of GPFS storage (2.2 Pbyte used!)
20-21 October 2015 8 UK-T0 Workshop
Complexity @ DiRAC
Leicester Complexity – HP system
• 4352 Intel Sandy Bridge cores
• 30 Tbyte of RAM
• FDR 1:1 non-blocking
• 0.8 Pbyte of Panasas storage
20-21 October 2015 9 UK-T0 Workshop
Cosmos @ DiRAC (SMP)
● Cambridge COSMOS
● SGI shared memory system
– 1856 Intel Sandy Bridge cores
– 31 Intel Xeon Phi co-processors
– 14.8 Tbyte of RAM
– 146 Tbyte of storage
20-21 October 2015 10 UK-T0 Workshop
HPCS @ DiRAC (Data Analytic)
Cambridge Data Analytic – Dell
• 4800 Intel Sandy Bridge cores
• 19.2 TByte of RAM
• FDR Infiniband 1:1 non-blocking
• 0.75 PB of Lustre storage
20-21 October 2015 11 UK-T0 Workshop
What is DiRAC
● A national service run/managed/allocated by the scientists who do the science funded by BIS and STFC
● The systems are built around and for the applications with which the science is done.
● We do not rival a facility like ARCHER, as we do not aspire to run a general national service.
● DiRAC is classed as a major research facility by STFC on a par with the big telescopes
20-21 October 2015 12 UK-T0 Workshop
What is DiRAC, cont’d
● Long projects with significant amount of CPU hours allocated for 3 years typically on a specific system – for 2012 – 2015 with examples:
– Cosmos - dp002 : ~20M cpu hours on Cambridge Cosmos
– Virgo-dp004 : 63M cpu hours on Durham DC
– UK-MHD-dp010 : 40.5M cpu hours on Durham DC
– UK-QCD-dp008 : ~700M cpu hours on Edinburgh BG
– Exeter – dp005: ~15M cpu hours on Leicester Complexity
– HPQCD – dp019 : ~20M cpu hours on Cambridge Data Analytic
20-21 October 2015 UK-T0 Workshop 13
What type of Science is done on DiRAC ?
● For the highlights of science carried out on the DiRAC facility please see: http://www.dirac.ac.uk/science.html
● Specific example: Large scale structure calculations with the Eagle run
– 4096 cores
– ~8 GB RAM/core
– 47 days = 4,620,288 cpu hours
– 200 TB of data
20-21 October 2015 14 UK-T0 Workshop
Why do we need to copy data (to RAL) ?
● Original plan - each research project should make provisions for storing the research data
– requires additional storage resource at researchers’ home institutions
– Not enough provision – will require additional funds.
– data creation considerably above expectation ?
– if disaster struck many cpu hours of calculations would be lost.
20-21 October 2015 15 UK-T0 Workshop
Why do we need to copy data (to RAL) ?
● Research data must now be shared with/available to interested parties
● Install DiRAC’s own archive – requires funds and currently there is no budget.
● we needed to get started:
– Jeremy Yates negotiated access to the RAL archive system
● Acquire expertise
● Identify bottlenecks and technical challenges
– submitted 2,000,000 files and created an issue at the file servers
● How can we collaborate and make use of previous experience.
● AND: copy data!
20-21 October 2015 16 UK-T0 Workshop
Copying data to RAL – network requirements
● network bandwidth – situation for Durham
– now: ● currently possible 300-400 Mbytes/sec
● required investment and collaboration from DU CIS
● upgrade to 6GBit/sec to JANET - Sep 2014
● will be 10 Gbit/sec by end of 2015 – infra structure already installed
– past:
identified Durham related bottlenecks - FIREWALL
20-21 October 2015 17 UK-T0 Workshop
Copying data to RAL – network requirements
● network bandwidth – situation for Durham
investment to by-pass of external campus firewall:
two new routers (~£80k) – configured for throughput with minimal ACL enough to safeguard site.
deploying internal firewalls – part of new security infrastructure, essential for such a venture
Security now relies on front-end system of Durham DiRAC and Durham GridPP.
20-21 October 2015 18 UK-T0 Workshop
Copying data to RAL – network requirements
Result for COSMA and GridPP in Durham
guaranteed 2-3 Gbit/sec with bursts of up to 3-4Gbit/sec
(3 Gbit/sec outside of term time)
pushed the network performance for Durham GridPP from bottom 3 in the country to top 5 of the UK GridPP sites
achieves up to 300 – 400 Mbyte/sec throughput to RAL on archiving depending on file sizes.
20-21 October 2015 19 UK-T0 Workshop
Collaboration between DiRAC and GridPP/RAL
● Durham Institute for Computational Cosmology (ICC) volunteered to be the prototype installation
● Huge thanks to Jens Jensen and Brian Davies - there were many emails exchanged, many questions asked and many answers given.
● Resulting document
“Setting up a system for data archiving using FTS3” by Lydia Heck, Jens Jensen and Brian Davies
20-21 October 2015 20 UK-T0 Workshop
Setting up the archiving tools
● Identify appropriate hardware – could mean extra expense:
need freedom to modify and experiment with - cannot have HPC users logged in and working!
free to do very latest security updates
requires optimal connection to storage - infiniband card
20-21 October 2015 21 UK-T0 Workshop
Setting up the archiving tools
● Create an interface to access the file/archving service at RAL using the GridPP tools
– gridftp – Globus Toolkit – also provides Globus Connect
– Trust anchors (egi-trustanchors)
– voms tools (emi3-xxx)
– fts3 (cern)
20-21 October 2015 UK-T0 Workshop 22
Archiving?
● long-lived voms proxy?
– myproxy-init; myproxy-logon; voms-proxy-init; fts-transfer-delegation
● How to create a proxy and delegation that lasts weeks even months? – still an issue
● grid-proxy-init; fts-transfer-delegation
– grid-proxy-init –valid HH:MM
– fts-transfer-delegation –e time-in-seconds
– creates proxy that lasts up to certificate life time.
20-21 October 2015 UK-T0 Workshop 23
Archiving
● Large files – optimal throughput limited by network bandwidth
● Many small files – limited by latency; using ‘-r’ flag to fts-transfer-submit to re-use connection
● Transferred:
– ~40 Tbytes since 20 August
– ~2M files
– challenge to FTS service at RAL
● User education on creating lots of small files
20-21 October 2015 UK-T0 Workshop 24
Open issues
● ownership and permissions are not preserved
● depends on single admin to carry out.
● what happens when content in directories change? – complete new archive sessions?
● tries to archive all the files again but then ‘fails’ as file already exists – should be more like rsync
20-21 October 2015 UK-T0 Workshop 25
Conclusions
● With the right network speed we can archive the DiRAC data to RAL.
● The documentation has to be completed and shared with the system managers on the other DiRAC sites
● Each DiRAC site will have their own dirac0X account
● Start with and keep on archiving
● Collaboration between DiRAC and GridPP/RAL DOES work!
● Can we aspire to more?
20-21 October 2015 UK-T0 Workshop 26