GROWING OPEN DATA: MAKING THE SHARING OF XXL-SIZED RESEARCH DATA FILES ONLINE A REALITY, USING EDINBURGH DATASHAREPAULINE WARD: [email protected] @PAULINEDATAWARDGEORGE HAMILTON
THE CHALLENGE
• Researchers are generating bigger files. At University of Edinburgh all researchers are entitled to 500 GB storage.
THE CHALLENGE
• Researchers need to be able to share their data online.• For impact.• For discoverability.• For reproducibility.• For compliance.
THE CHALLENGE
• DataShare is the Institutional Repository for research data for staff and students at the University of Edinburgh: datashare.is.ed.ac.uk .• Previous file size limit of 2.1 GB.• Largest file we’ve been asked to share: 20 GB – split into smaller
files.• Largest fileset we’ve been asked to share: 226 GB – split into
smaller filesets.
THE CHALLENGE
• Some files had to be imported via time-consuming batch import process because too big / too numerous for web deposit.• Some files still waiting to be shared because they are too big
for users to be able to conveniently download them.• These files are generated from a wide range of disciplines
and wide range of methods.
THE SOLUTION
• Getting the files from the depositors: address upload • Allowing users to get the files: address download
THE SOLUTION: UPLOAD
• HTML5 resumable upload
THE SOLUTION: UPLOAD
• EDINA’s code for implementing HTML5 upload in DSpace is on GitHub: https://github.com/edina/DSpace/tree/xml-html5-upload • Uses resumable.js• This was the XMLUI re-write of functionality that was
available for DSpace 5.0 JSPUI. See https://jira.duraspace.org/browse/DS-1562 for further details.
THE SOLUTION: UPLOAD
• Testing shows files up to 15 GB upload successfully.• (cf figshare 5 GB file size limit, Zenodo 2 GB)• 20 GB file upload has been done in testing, but generates an error
message in the browser, and the user must find and Resume the submission from the Submissions page
• Multiple files can be uploaded by drag’n’drop.
THE SOLUTION: DOWNLOAD
We wanted a mechanism, which DSpace doesn’t provide, of zipping up files for download.• BitTorrent was one possible approach: could be added at a
later date• Other approaches possible (Rsync, Secure Copy (SCP))
THE SOLUTION: DOWNLOAD
• FTP download: agreed• Tried and tested technology that we are confident we can put in place
and will work well• All files will be accessed from the FTP server anonymously• Users can still download files via browser via FTP• Users who wish can use an FTP client, allowing them to resume a
download
THE SOLUTION: DOWNLOAD
• Specification:• All files will still be required to have appropriate metadata stored in
DSpace• All filesets will now be downloadable as a zip file (previous 5.2 GB
limit)• Move DSpace assetstore to a location where more storage available• Statistics (i.e. numbers) of file downloads by SFTP will be added to
DSpace statistics
THE SOLUTION: DOWNLOAD
• This is a replacement for our current on-the-fly zip file creation of Item bitstreams.• Will mitigate potential performance issues. Because it will use
less server resources (Java threads and RAM)
SUMMARY
• We have implemented HTML5 upload in the DataShare (DSpace) web interface to allow depositors to easily and quickly deposit individual files up to 15 GB.• We are working on integrating an SFTP server to allow users to
retrieve filesets larger than our current 20 GB limit. Storage rather than network/browser timeout will become the limiting factor on fileset size. We anticipate making numerous filesets around 100 GB available in this way in the medium term.
Top Related