Briefing Paper on RDM Technical Infrastructure...Archive Data Storage and Digital Preservation Many...

8
University Library. Report to the University of Sheffield Research Data Management Service Delivery Group Briefing Paper on RDM Technical Infrastructure Date: 05/11/2014 Author: John A. Lewis

Transcript of Briefing Paper on RDM Technical Infrastructure...Archive Data Storage and Digital Preservation Many...

Page 1: Briefing Paper on RDM Technical Infrastructure...Archive Data Storage and Digital Preservation Many archive data storage arrangements distinguish between permanent archival storage,

University Library.

Report to the University of Sheffield Research Data Management Service Delivery Group

Briefing Paper on RDM Technical Infrastructure

Date: 05/11/2014

Author: John A. Lewis

Page 2: Briefing Paper on RDM Technical Infrastructure...Archive Data Storage and Digital Preservation Many archive data storage arrangements distinguish between permanent archival storage,

Briefing Paper on RDM Technical Infrastructure

The provision of a technical infrastructure for RDM is intended to satisfy the researchers’ RDM needs

and to make the work of the researcher and institution easier. An appropriately designed technical

infrastructure will help researchers to achieve good research practice simply by utilising the facilities

(without thinking about RDM), to fulfil their obligations to the institution and funders without extra

work, and to have all processes involved work together seamlessly.

The Research Data Life-cycle

This is a means of visualising the flow of research data and the processes involved, during a generic

research project (see fig. 1). At various points during the Research Data Lifecycle, provision of

appropriate technical infrastructure may greatly benefit researchers’ workflows:-

1. Research Planning

It is widely accepted that RDM represents good research practice, and as such the creation of a Data

Management Plan (DMP) offers the researcher an opportunity to determine the most appropriate

data management procedures to apply during and after a research project. Creating a DMP will help

the researcher avoid errors, particularly research data loss disasters, and time-consuming

retrospective data management. A DMP will benefit the overall planning of the research project.

2. Active Research Data Management

‘Active’ or ‘live’, data are the research data created or collected, and processed or derived during the

active phase of the research project. Once underway, a research project will collect or create raw

data, which, during the project, will usually be processed to create derived or processed data. There

may be many different iterations of processing, resulting in many sets of derived data, and,

eventually a set of ‘results’ data selected as the basis of the research publication(s) output by the

project. All these sets of data can be considered active data, which will need to be quickly accessible,

and easily shared between collaborators, involving stringent security arrangements.

3. Data Documentation

In order to identify, organise, find and retrieve active data, appropriate documentation is essential.

Data documentation also indicates the conditions and processes involved in the creation or

collection of the data, the processing of the data and the context of the research - Detailed

documentation is essential for verification and reuse. Metadata are a highly structured subset of

core data documentation. Metadata are structured so that they may be indexed and stored within a

database, thereby facilitating data organisation and discovery, and machine to machine

interoperability. Metadata are necessary to determine provenance, licensing and access

arrangements, preservation requirements and for discovery and citation.

Ideally, the ‘metadata capture’ process (‘data cataloguing’) should be automated where possible to

reduce the amount of manual annotation required of researchers. As well as reducing ‘double-

keying’, which is frustrating for researchers, the number of errors inevitably introduced through

manual input, is reduced

Page 3: Briefing Paper on RDM Technical Infrastructure...Archive Data Storage and Digital Preservation Many archive data storage arrangements distinguish between permanent archival storage,

4. Data Selection

Because it is not feasible to preserve all the research data produced during a project, for reasons of

cost and discoverability, therefore a process of data selection / appraisal needs to be carried out.

Preservation of some research data may be a condition of funding. The responsibility for this process

will lie with a Data Librarian or Archivist. Data may need ‘cleaning’ by editing corrupt or incorrect

elements to ensure integrity. Data not selected for preservation will need deletion in an appropriate

manner.

5. Data Archiving

Data selected for preservation will be usually ingested into a Digital Repository where they may be

actively preserved (or curated), ensuring they remain immutable (must never change), but accessible

(in usable formats) in the long-term (beyond 10 years). ‘Read only’ access is required and slower

access time will be acceptable. There may be no request for access for long periods of time if ever. A

copy of the dataset may be held in a local cache for quick access, otherwise an access copy will be

requested.

6. Data Publishing

Datasets are published by making the associated metadata records available for discovery through a

catalogue and making the files available for download from appropriate storage. Whereas the

metadata will probably be openly accessible, there may be access conditions in place to control

access to the data themselves. Further processing of the data, such as anonymisation or redaction,

may be necessary to make them publicly accessible.

7. Data Discovery and Reuse

In many disciplines, Research Data may now be considered a primary research output, to be

discovered, reused, cited and achieve impact. Such published data may be processed and

manipulated in different ways to those of the original creator / collector, or combined with other

datasets to derive further results and conclusions. Thus the funder achieves more return on their

original funding, the researcher achieves more impact from their research and researchers will

waste less time and effort duplicating research.

Institutional RDM Policy & Funder Requirements

In line with the RCUK common principles on data policy1, the EPSRC Expectations2 of organisations

receiving EPSRC funding, include the requirements that the organisation will:

Publish appropriately structured metadata describing the research data they hold -

therefore the institution must create a public data catalogue.

Ensure that EPSRC-funded data is securely preserved for a minimum of ten years – therefore

the institution must create a data archive.

Ensure that effective data curation is provided throughout the full data lifecycle – therefore

the institution must provide the necessary human and technical infrastructure required.

Institutions in receipt of EPSRC funding are expected to be compliant with these expectations by 1st

May 2015. The University of Sheffield Research Data Management Policy3 was developed in

response to the RCUK principles and EPSRC expectations. This states that the University will develop

infrastructure and services to support research data management in consultation with researchers.

Page 4: Briefing Paper on RDM Technical Infrastructure...Archive Data Storage and Digital Preservation Many archive data storage arrangements distinguish between permanent archival storage,

Functional Components of the Technical Infrastructure

Many implementations of RDM technical infrastructure have involved an overlap in the functions

provided by the different components (see fig.2). Many components have been designed around the

functional requirements determined by researcher workflows, however, most implementations have

had to take into account existing systems and modifications have been engineered to ensure

interoperability. The major functional requirement of every component is interoperability –

achieved chiefly by adherence to data and metadata standards.

1. CRIS

Depending upon the particular system and configuration, the Current Research Information System

holds information about the researcher, project, grant and funder. The CRIS will be interoperable

with the HR system, Grants / Awards management system and possibly other systems such as

research costing, facilities management and the institutional CMS. The CRIS provides a register

(inward facing catalogue) of the researcher’s published outputs and may act as a means to deposit

publications into the Institutional Repository if interoperable. The CRIS may feasibly be used to push

metadata only records of research data to the Institutional Repository, or in itself, function as a

catalogue of the institution’s published research data outputs. The ‘Pure’ CRIS offers a public facing

catalogue facility – the ‘Pure Portal’ which can function as an Institutional Repository.

2. Data Management Planning Tools

These are facilities for aiding researchers compile the DMP required by some funding organisations

to be submitted as part of the grant application process. Such tools, (such as the DCC’s DMPonline

Tool) provide templates for the different major funders and may be customised for the institution.

The tools may be accessed via institutional login and the DMPs created, stored in the institutional

CRIS, Research Management System or by a DMP service.

3. Data and Metadata Capture

Metadata capture (or data cataloguing) may be accomplished simply by providing an interface for

researchers to fill out online forms. Automatic metadata capture, concurrent with data capture, may

be facilitated by using appropriate instruments and equipment and save data to the laboratory,

departmental or facility file store or the institutional network. Electronic lab books, electron

microscopes and other imaging instruments, genetic sequencing and analysis instruments usually

have attached local storage or may feed data to a project based Laboratory Information

Management Systems (LIMS). Ideally, data and metadata need to be transferred to central active

data storage.

4. Active Data Storage (including HPC Grid storage)

Active research data need to be rapidly accessed, easily shared between collaborators with access

being controlled through stringent security arrangements. It may be necessary to differentiate

between data requiring ‘Read-Write’ and data requiring ‘Read-only’ access, to provide cost-effective

storage. Working datasets, which change constantly (as they are being created, added to, processed

and edited), will require Read-Write access, frequent back-up and may require large computational

resources. However, much ‘active data’ will be immutable and only require Read-only access, so

therefore may possibly be moved to ‘Archive data’ storage.

Page 5: Briefing Paper on RDM Technical Infrastructure...Archive Data Storage and Digital Preservation Many archive data storage arrangements distinguish between permanent archival storage,

5. Active Data Management / Collaboration Management

At the University of Sheffield, collaborative computing is provided through the shared

(departmental) networked drives, the HPC Grid service and Google drive (Google Apps for Education

cloud storage). Collaborative computing systems, such as HPC Grids, Dropbox-like cloud storage

services, Virtual Research Environments (VRE) and Laboratory Information Management Systems

(LIMS), have been developed to accommodate the need for secure, read-write access to shared

storage. Active data management systems may be considered to be comprised of three functional

components: a storage layer, a data registry (or metadata store or asset registry) database layer and

a User interface layer. In some cases these components will be integrated into a single system, in

other cases, the metadata may be handled by the CRIS and storage by the institutional network.

6. Research Data Selection / Deposit Facility

The institution may provide a service to help researchers appraise their data, assess the preservation

requirements, help with submission to the institutional repository and help with submission to

external repositories.

7. Archive Data Storage and Digital Preservation

Many archive data storage arrangements distinguish between permanent archival storage, maybe

through an external service, and operational storage on a local server, holding ingested files for

processing and access copies of the data. This distinction is due to the slow access speed and higher

cost of retrieval from archival storage. Control of the system will be mediated by a Storage resource

broker.

Data selected for long-term preservation will require storage that ensures the file remains

immutable. Such ‘Bit preservation’ requires constant management (regular checksum) and back-up

to a variety of media including tape storage and off-site or cloud storage. A number of vendors offer

a digital archiving service (Arkivum, AWS); because of the high costs involved in retrieval, such

archival storage services are most appropriate for back-up copies of ‘canonical’ data. Digital

preservation involves ensuring that the material will remain accessible in perpetuity (through format

migration), as well as ensuring ‘bit preservation’ data immutability.

A Research Data Archive preserves data not, or not yet, submitted to discipline-based data

repositories. The associated metadata records are held in the research data registry or catalogue.

8. Research Data Registry

The Registry is defined here as an inward-facing catalogue that holds the metadata records of

unpublished research data. The data themselves will be held in the institutional data archive. The

data and metadata may be eventually published by ingest into a discipline-based data repository

outside the institution or into the Institutional Data Repository. The CRIS may function as a research

data registry, providing researchers with an interface to record metadata in order to register a

dataset.

9. Research Data Repository

A repository may be defined as a Digital Asset Management System (DAMS), consisting of three

layers: A storage layer back-end (may be a hybrid of local, external and cloud storage); A ‘metadata

store’ database layer; A user interface or access platform front-end. A wide range of repositories

systems are available: proprietary, externally managed services to open source software systems.

Page 6: Briefing Paper on RDM Technical Infrastructure...Archive Data Storage and Digital Preservation Many archive data storage arrangements distinguish between permanent archival storage,

Many have been designed to manage specific media, or designed to provide digital preservation or

Research data management specific functions. A Digital Preservation System is a DAMS / repository

that manages the active preservation of the content as well as the storage and metadata

management and access interface.

Archive data is possibly best managed in a discipline based repository or data centre, whilst the

Institutional repository may be considered the ‘repository of last resort’. The Institutional Research

Data Repository (or the institutional repository, if it has been modified to accommodate data) is an

appropriate home for datasets for which there is no discipline based repository or data centre, or for

temporary storage before being submitted to a data centre. The Institutional Research Data

Repository may provide a public catalogue for all published data created at the institution – the data

being held by data centres / discipline based repositories as well as that held by the institution.

The catalogue and archive storage functions of the repository may be separated. In such an

arrangement, the archive storage function may be achieved using an external service or in an

institutional data archive, but access and deposit managed seamlessly through the repository

platform. Where the repository holds only metadata records, it may be considered a Research Data

Catalogue.

10. Research Data Catalogue

The Catalogue is defined here as a publicly-accessible catalogue, holding the metadata records of

published research data. The data themselves may be held in a discipline-based data repository

outside the institution or in an institutional data archive. The selection of the underlying metadata

schema is fundamental and consideration must be given to the schema used by the proposed

National Data Registry. Many institutions favour the Datacite metadata schema4, subscription to

which provides the means to mint DOIs and assurance of a standard level of preservation. The

Research Data catalogue may be provided by a number of repository platforms, access platforms or

catalogue software systems.

John A. Lewis 05/11/14

1 RCUK common principles on data policy http://www.rcuk.ac.uk/research/datapolicy/

2 EPSRC Expectations http://www.epsrc.ac.uk/about/standards/researchdata/Pages/expectations.aspx

3 University of Sheffield Research Data Management Policy

http://www.shef.ac.uk/ris/other/gov-ethics/grippolicy/practices/all/rdmpolicy 4 Datacite metadata schema http://schema.datacite.org/

Page 7: Briefing Paper on RDM Technical Infrastructure...Archive Data Storage and Digital Preservation Many archive data storage arrangements distinguish between permanent archival storage,

Planning Data Management

Organising & Documenting

Your Data

Storing & Securing Your

Data

Preserving & Sharing Your

Data

http://ukdataservice.ac.uk/media/132177/data_lifecycle_recolour.png

http://datalib.edina.ac.uk/mantra/datamanagementplans/media/RDMcycle.png

Fig.1 Some representations of the Research Data Lifecycle

Page 8: Briefing Paper on RDM Technical Infrastructure...Archive Data Storage and Digital Preservation Many archive data storage arrangements distinguish between permanent archival storage,

Data Centre / Disciplinary Repository

Institutional Research Data

Catalogue

Data Registry Data Archive /

Storage

Active Data Management

System

Archive Data Storage

(for Public Access)

Active Data

Registry

Manual Data documentation

Automatic Metadata Capture

SWORD2

OAI-PMH

File metadata

Dat

a P

roce

ssin

g

Act

ive

Dat

a A

rch

ive

Dat

a

External SWORD2 SWORD2

Research Data Flow Metadata Flow

Institution / Virtual Org

Research Data Archive

(Digital Preservation)

Archive Data Storage

(Digital Preservation)

Institutional Research Data

Repository

Researcher & Project metadata

CRIS

Institutional Research

Data Registry

Dataset Metadata & Description at project end (Manual Upload?)

Dataset Metadata & Description on Dataset Publication

Man

ual D

ata Do

cum

en

tation

on

Datase

t Pu

blicatio

n

Dataset Deposit at project end

Pri

vate

->

<

- P

ub

lic

Dataset Publication

Dat

ase

t P

ub

licat

ion

D

atas

et

Acc

ess

& R

eu

se

<- Data Metadata ->

Dataset Preservation

Data Capture

Instrument

Active Data Storage

Fig.2 – RDM Technical Infrastructure Dataflow

DMP Service

DM

Plan

s

Researcher & Project metadata

HR & Research Management

Researcher