Experience of Data Transfer to the Tier-1 from a DIRAC ... · Experience of Data Transfer to the...

26
20-21 October 2015 UK-T0 Workshop 1 Experience of Data Transfer to the Tier - 1 from a DIRAC Perspective Lydia Heck Institute for Computational Cosmology Manager of the DiRAC - 2 Data Centric Facility COSMA

Transcript of Experience of Data Transfer to the Tier-1 from a DIRAC ... · Experience of Data Transfer to the...

20-21 October 2015 UK-T0 Workshop 1

Experience of Data Transfer to the Tier-1 from a DIRAC

Perspective

Lydia Heck

Institute for Computational Cosmology

Manager of the DiRAC-2 Data Centric Facility COSMA

20-21 October 2015 UK-T0 Workshop 2

Talk layout

● Introduction to DiRAC ?

● The DiRAC computing systems

● What is DiRAC

● What type of science is done on the DiRAC facility ?

● Why do we need to copy data to RAL?

● Copying data to RAL – network requirements

● Collaboration between DiRAC and RAL to produce the archive

● Setting up the archiving tools

● Archiving

● Open issues

● Conclusions

20-21 October 2015 UK-T0 Workshop 3

Introduction to DiRAC ● DIRAC -- Distributed Research utilising Advanced

Computing established in 2009 with DiRAC-1

● Support of research in theoretical astronomy, particle physics and nuclear physics

● Funded by STFC with infrastructure money allocated from the Department for Business, Innovation and Skills (BIS)

● The running costs, such as staff costs and electricity are funded by STFC

Introduction to DiRAC, cont’d

● 2009 – DiRAC-1

– 8 installations across the UK of which COSMA-4 at the ICC in Durham is one. Still a loose federation.

● 2011/2012 – DiRAC-2

– major funding of £15M for e-Infrastructure

– in bidding to host – 5 installations identified – judged by peers

– for successful bidders scrutiny and interview by representatives for BIS to see if we could deliver by a tight deadline

20-21 October 2015 UK-T0 Workshop 4

Introduction to DiRAC, cont’d

● DiRAC has full management structure.

● Computing time on the DiRAC facility is allocated through a peer-reviewed procedure.

● Current director: Dr Jeremy Yates, UCL

● Current technical director:Prof Peter Boyle, Edinburgh

20-21 October 2015 UK-T0 Workshop 5

The DiRAC computing systems

20-21 October 2015 6 UK-T0 Workshop

Blue Gene Edinburgh

Cosmos Cambridge

Complexity Leicester

Data Centric Durham

Data Analytic Cambridge

The Bluegene @ DiRAC

● Edinburgh – IBM Blue Gene

– 98304 cores

– 1 Pbyte of GPFS storage

– designed around (Lattice)QCD applications

20-21 October 2015 7 UK-T0 Workshop

COSMA @ DiRAC (Data Centric)

● Durham – Data Centric system –IBM IDataplex

– 6720 Intel Sandy Bridge cores

– 53.8 TB of RAM

– FDR10 infiniband 2:1 blocking

– 2.5 Pbyte of GPFS storage (2.2 Pbyte used!)

20-21 October 2015 8 UK-T0 Workshop

Complexity @ DiRAC

Leicester Complexity – HP system

• 4352 Intel Sandy Bridge cores

• 30 Tbyte of RAM

• FDR 1:1 non-blocking

• 0.8 Pbyte of Panasas storage

20-21 October 2015 9 UK-T0 Workshop

Cosmos @ DiRAC (SMP)

● Cambridge COSMOS

● SGI shared memory system

– 1856 Intel Sandy Bridge cores

– 31 Intel Xeon Phi co-processors

– 14.8 Tbyte of RAM

– 146 Tbyte of storage

20-21 October 2015 10 UK-T0 Workshop

HPCS @ DiRAC (Data Analytic)

Cambridge Data Analytic – Dell

• 4800 Intel Sandy Bridge cores

• 19.2 TByte of RAM

• FDR Infiniband 1:1 non-blocking

• 0.75 PB of Lustre storage

20-21 October 2015 11 UK-T0 Workshop

What is DiRAC

● A national service run/managed/allocated by the scientists who do the science funded by BIS and STFC

● The systems are built around and for the applications with which the science is done.

● We do not rival a facility like ARCHER, as we do not aspire to run a general national service.

● DiRAC is classed as a major research facility by STFC on a par with the big telescopes

20-21 October 2015 12 UK-T0 Workshop

What is DiRAC, cont’d

● Long projects with significant amount of CPU hours allocated for 3 years typically on a specific system – for 2012 – 2015 with examples:

– Cosmos - dp002 : ~20M cpu hours on Cambridge Cosmos

– Virgo-dp004 : 63M cpu hours on Durham DC

– UK-MHD-dp010 : 40.5M cpu hours on Durham DC

– UK-QCD-dp008 : ~700M cpu hours on Edinburgh BG

– Exeter – dp005: ~15M cpu hours on Leicester Complexity

– HPQCD – dp019 : ~20M cpu hours on Cambridge Data Analytic

20-21 October 2015 UK-T0 Workshop 13

What type of Science is done on DiRAC ?

● For the highlights of science carried out on the DiRAC facility please see: http://www.dirac.ac.uk/science.html

● Specific example: Large scale structure calculations with the Eagle run

– 4096 cores

– ~8 GB RAM/core

– 47 days = 4,620,288 cpu hours

– 200 TB of data

20-21 October 2015 14 UK-T0 Workshop

Why do we need to copy data (to RAL) ?

● Original plan - each research project should make provisions for storing the research data

– requires additional storage resource at researchers’ home institutions

– Not enough provision – will require additional funds.

– data creation considerably above expectation ?

– if disaster struck many cpu hours of calculations would be lost.

20-21 October 2015 15 UK-T0 Workshop

Why do we need to copy data (to RAL) ?

● Research data must now be shared with/available to interested parties

● Install DiRAC’s own archive – requires funds and currently there is no budget.

● we needed to get started:

– Jeremy Yates negotiated access to the RAL archive system

● Acquire expertise

● Identify bottlenecks and technical challenges

– submitted 2,000,000 files and created an issue at the file servers

● How can we collaborate and make use of previous experience.

● AND: copy data!

20-21 October 2015 16 UK-T0 Workshop

Copying data to RAL – network requirements

● network bandwidth – situation for Durham

– now: ● currently possible 300-400 Mbytes/sec

● required investment and collaboration from DU CIS

● upgrade to 6GBit/sec to JANET - Sep 2014

● will be 10 Gbit/sec by end of 2015 – infra structure already installed

– past:

identified Durham related bottlenecks - FIREWALL

20-21 October 2015 17 UK-T0 Workshop

Copying data to RAL – network requirements

● network bandwidth – situation for Durham

investment to by-pass of external campus firewall:

two new routers (~£80k) – configured for throughput with minimal ACL enough to safeguard site.

deploying internal firewalls – part of new security infrastructure, essential for such a venture

Security now relies on front-end system of Durham DiRAC and Durham GridPP.

20-21 October 2015 18 UK-T0 Workshop

Copying data to RAL – network requirements

Result for COSMA and GridPP in Durham

guaranteed 2-3 Gbit/sec with bursts of up to 3-4Gbit/sec

(3 Gbit/sec outside of term time)

pushed the network performance for Durham GridPP from bottom 3 in the country to top 5 of the UK GridPP sites

achieves up to 300 – 400 Mbyte/sec throughput to RAL on archiving depending on file sizes.

20-21 October 2015 19 UK-T0 Workshop

Collaboration between DiRAC and GridPP/RAL

● Durham Institute for Computational Cosmology (ICC) volunteered to be the prototype installation

● Huge thanks to Jens Jensen and Brian Davies - there were many emails exchanged, many questions asked and many answers given.

● Resulting document

“Setting up a system for data archiving using FTS3” by Lydia Heck, Jens Jensen and Brian Davies

20-21 October 2015 20 UK-T0 Workshop

Setting up the archiving tools

● Identify appropriate hardware – could mean extra expense:

need freedom to modify and experiment with - cannot have HPC users logged in and working!

free to do very latest security updates

requires optimal connection to storage - infiniband card

20-21 October 2015 21 UK-T0 Workshop

Setting up the archiving tools

● Create an interface to access the file/archving service at RAL using the GridPP tools

– gridftp – Globus Toolkit – also provides Globus Connect

– Trust anchors (egi-trustanchors)

– voms tools (emi3-xxx)

– fts3 (cern)

20-21 October 2015 UK-T0 Workshop 22

Archiving?

● long-lived voms proxy?

– myproxy-init; myproxy-logon; voms-proxy-init; fts-transfer-delegation

● How to create a proxy and delegation that lasts weeks even months? – still an issue

● grid-proxy-init; fts-transfer-delegation

– grid-proxy-init –valid HH:MM

– fts-transfer-delegation –e time-in-seconds

– creates proxy that lasts up to certificate life time.

20-21 October 2015 UK-T0 Workshop 23

Archiving

● Large files – optimal throughput limited by network bandwidth

● Many small files – limited by latency; using ‘-r’ flag to fts-transfer-submit to re-use connection

● Transferred:

– ~40 Tbytes since 20 August

– ~2M files

– challenge to FTS service at RAL

● User education on creating lots of small files

20-21 October 2015 UK-T0 Workshop 24

Open issues

● ownership and permissions are not preserved

● depends on single admin to carry out.

● what happens when content in directories change? – complete new archive sessions?

● tries to archive all the files again but then ‘fails’ as file already exists – should be more like rsync

20-21 October 2015 UK-T0 Workshop 25

Conclusions

● With the right network speed we can archive the DiRAC data to RAL.

● The documentation has to be completed and shared with the system managers on the other DiRAC sites

● Each DiRAC site will have their own dirac0X account

● Start with and keep on archiving

● Collaboration between DiRAC and GridPP/RAL DOES work!

● Can we aspire to more?

20-21 October 2015 UK-T0 Workshop 26