Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens...

21
Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13 October 2006

Transcript of Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens...

Page 1: Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

Testing the UK Tier 2 Data Storage and Transfer

Infrastructure

C. Brew (RAL)Y. Coppens (Birmingham), G.

Cowen (Edinburgh) & J. Ferguson (Glasgow)

9-13 October 2006

Page 2: Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

Outline

• What are we testing and why?• What is the setup?• Hardware and Software Infrastructure• Test Procedures• Lessons and Successes• RAL Castor• Conclusions and Future

Page 3: Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

What and Why

• What:– Set up systems and people to test the rates at which the

UK Tier 2 sites can import and export data

• Why:– Once the LHC experiments are up and running Tier 2

sites will need to absorb data from and upload data to the Tier 1s at quite alarming rates:

• ~1Gb/s for a medium sized Tier 2

– UK has a number of “experts” in tuning DPM/dCache, this should spread some of this knowledge

– Get local admins at the sites to learn a bit more about their upstream networks

Page 4: Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

Why T2 → T2

• CERN is driving the Tier 0 to Tier 1 and the Tier 1 to Tier 1 but the Tier 2s need to get ready.

• But no experiment has a use case that calls for transfers between Tier 2 sites?– Test the network/storage infrastructure at each

Tier 2 site.– Too many sites to test each against T1– T1 busy with T0 → T1 and T1 → T1– T1 → T2 tests run at then end of last year

Page 5: Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

Physical Infrastructure

• Each UK Tier 2 site has a SRM Storage Element, either dCache or DPM

• Generally the network path is:– Departmental Network– Site/University Network– Metropolitan Area Network– JANET (UK’s educational/research backbone)

• Connection speeds vary from a share of 100Mbs to 10Gb/s, generally 1 or 2 Gb/s

Page 6: Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

Network Infrastructure

Dept Site/Uni MAN UK Backbone

Page 7: Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

Physical Infrastructure

• Each UK Tier 2 site has a SRM Storage Element, either dCache or DPM

• Generally the network path is:– Departmental Network– Site/University Network– Metropolitan Area Network– JANET (UK’s educational/research backbone)

• Connection speeds vary from a share of 100Mbs to 10Gb/s, generally 1 or 2 Gb/s

Page 8: Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

Software Used

• This is a test of the Grid software stack as well as the T2 hardware. Therefore we try to use that:– Data Sink/Source is the SRM compliant SE– Transfers are done using the File Transfer Service (FTS)– filetransfer script used to submit and monitor the FTS

transfers:http://www.physics.gla.ac.uk/~graeme/scripts/|filetransfer

– Transfers are generally done over the production network to the production software without special short term tweaks

Page 9: Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

File Transfer Service

• Fairly recent addition of the LCG Middleware– Manages the transfer of SRM files from one SRM

server to another, manages bandwidth and queues and retries failed transfers

– Defines “Channels” to transfer files between sites– Generally each T2 has three channels defined

• T1 → Site• T1 ← Site• Elsewhere → Site

– Each channel sets connection parameters, limits on the number of parallel transfers, VO shares, etc.

Page 10: Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

Setup

• GridPP had done some T1 ↔ T2 transfer tests last year

• Three sites which had already demonstrated > 300 Mb/s Transfer rates in the previous tests chosen as reference sites

• Each site to be tested nominated a named individual to “own” the tests for their site

Page 11: Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

Procedure

• Three weeks before the official start of the tests, the reference sites started testing against each other– Confirmed that they could still achieve the necessary rates– Tested the software to be used in the test

• Each T2 site was assigned a reference site to be it’s surrogate T1 and a time slot to perform 24hr read and write tests

• Basic site test was:– Beforehand copy 100 1GB “canned” to the source SRM– Repeatedly transfer these files to the sink for 24hrs– Reverse the flow and copy data from the reference for 24hrs– Rate is simply (No Files successfully transferred * Size) / Time

Page 12: Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

Issues / Lessons

• Loss of a reference site before we even started– despite achieving very good rates in the

previous tests, no substantive change by the site and heroic efforts, it could not sustain >250Mb/s

• Tight timescale– Tests using each reference site were scheduled

for each working day so if a site missed its slot or had a problem during the test there was no room to catch up

Page 13: Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

Issues / Lessons

• Lack of pre-test tests– Sites only had a 48hr slot for two 24 hour tests and

reference sites were normally busy with other tests for so there was little opportunity for sites to tune their storage/channel before the main tests

• Network Variability– Especially prevalent during the reference site tests– Performance could vary hour by hour by as much as 50%

for no reason apparent on the LANs at either end– In the long term, changes upstream (new Firewall, or rate

limiting by your MAN) can reduce previous good rates to a trickle

Page 14: Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

Issues / Lessons

• Needed a better recipe– With limited opportunities for site admins to try

out the software a better recipe for prepare and run the test would have helped.

• Email communication wasn’t always ideal– Would have been better to get phone numbers

for all the site contacts

• Ganglia bandwidth plots seem to under estimate the rate

Page 15: Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

What worked

• Despite the above — The tests• Community Support

– Reference sites got early experience running the tests and could help the early sites who in turn could help the next wave and so on

• Service Reliability– The FTS much more reliable than in previous

tests – Some problems with the myproxy service

stopping causing transfers to stop

• Sites owning the tests

Page 16: Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

Where are we now?

• 14 out of 19 sites have participated, and have successfully completed 21 out of 38 tests

• >60 TB of data has been transferred between sites

• Max recorded transfer rate: 330Mb/s• Min recorded transfer rate: 27Mb/s

Page 17: Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

Now

Page 18: Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

RAL Castor

• During the latter part of the tests the new CASTOR at RAL was ready for testing

• We had a large pool of sites which had already been tested and admin familiar with the test software who could quickly run the same tests with the new CASTOR as the endpoint

• This enabled us to run tests against CASTOR and get good results whilst still running the main tests

• In turn helped the CASTOR team in their superhuman efforts to get CASTOR ready for CMS’s CSA06 tests

Page 19: Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

Conclusions

• UK Tier Twos have started to prepare for the data challenges that LHC running will bring

• Network “weather” is variable and can have a big effect

• As can any one of the upstream network providers

Page 20: Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

Future

• Work with sites with low rates to understand and correct them

• Keep running tests like this regularly:– sites that can do 250Mb/s now should be doing

500Mb/s by next spring and 1GB/s by this time next year

Page 21: Testing the UK Tier 2 Data Storage and Transfer Infrastructure C. Brew (RAL) Y. Coppens (Birmingham), G. Cowen (Edinburgh) & J. Ferguson (Glasgow) 9-13.

Thanks…

• Most of the actual work for this was done by Jamie, who co-ordinated everything and the sysadmins, Grieg, Mark, Yves, Pete, Winnie, Graham, Olivier, Alessandra and Santanu, who ran the tests and Matt, who kept the central services running.