Research Data Management: the institutional and … Data Management: the institutional and national...

36
Research Data Management: the institutional and national challenges Simon Hodson JISC Programme Manager, Managing Research Data Wednesday 23 May 2012 Approaches to Research Data Management, UWE, Bristol

Transcript of Research Data Management: the institutional and … Data Management: the institutional and national...

Research Data Management: the institutional and

national challenges

Simon Hodson

JISC Programme Manager, Managing Research Data

Wednesday 23 May 2012

Approaches to Research Data Management, UWE, Bristol

Deluges and inundations…

A Surfboard for Riding the Wave, a four

country action programme on research data

http://bit.ly/KE_Surfboard

Hey, Trevethen, ‘The Data Deluge: an e-

Science Perspective’ (2003):

http://eprints.soton.ac.uk/257648/

Riding the Wave:

http://cordis.europa.eu/fp7/i

ct/e-infrastructure/docs/hlg-

sdi-report.pdf

PDB

GenBank

UniProt

Pfam

Spreadsheets, Notebooks

Local, Lost

High throughput experimental methods

Industrial scale

Commons based production

Publicly available data sets

Preserved

CATH, SCOP

(Protein Structure

Classification)

ChemSpider

Slide Credit: Carole Goble, Liz Lyon

Volume: the long tail…

Estimated Research Data Requirements

Two Russell Group Universities

Estimated current data holdings of c.2PB (managed and unmanaged)

Currently provide 800TB/300TB in a central storage facility, not all of which is

used (but will be full in 12-18 months)…

Significant amount of data in temporary storage, external drives etc…

‘the more groups we go to talk to, the more we're hearing of significant

data holdings on external hard drives and small RAID systems’

1994 Group University

No central research data provision.

Faculties (medicine, business, humanities) have 20-30TB each.

Engineering currently has 170TB faculty system, urgent need to expand.

But… one group, recently interviewed, currently has 250TB, only half in

‘managed storage’; will reach PB levels in the next few years.

DUDs

The data centre under the desk (or in a back pack) is

not adequate.

Evidence that significant data loss occurs…

‘Departments typically don’t have guidelines or norms for personal

back-up and researcher procedure, knowledge and diligence varies

tremendously. Many have experienced moderate to catastrophic

data loss.’

– Incremental Project Scoping Study and Implementation Plan

http://www.lib.cam.ac.uk/preservation/incremental/documents/Incremental_Scoping_Report_1

70910.pdf

‘The current environment is such that responsibility for good data

management is devolved to individual researchers and in practice PIs

set the 'rules' and establish the cultural practices of the research

groups and this means there is good data management practice

going on in pockets but no consistency across groups. There is also

consequently a high risk of data losses by a number of means’.

– MaDAM Project Requirements Analysis

http://www.merc.ac.uk/sites/default/files/MaDAM_Requirements%20_%20gap%20analysis-

v1.4-FINAL.pdf

Can we quantify the benefits

of reducing data loss?

JISCMRD Project Survey (interim analysis)

262 respondents.

23.3% of respondents (61) have lost research data

– One respondent had lost all their research data as it had not been backed

up.

– 20 had lost one week’s work

– 23 had lost one day’s work

Why manage research data?

Not just about storage or avoiding data loss…!

It’s about knowing what to keep and what to throw away…

Important to extract maximum return on investment from publicly

funded research.

Access to underlying data is essential for verification and therefore

research integrity.

Opportunities to extract more knowledge from existing data, new

analysis: new research questions, data integration, meta studies.

It’s about making the most out of data created!

Finally, a lamentable element of the culture in social psychology and psychology research is for everyone to keep their own data and not make them available to a public archive. This is a problem on a much larger scale, as has recently become apparent. Even where a journal demands data accessibility, authors usually do not comply (Wicherts et al. 2006). Archiving and public access to research data not only makes this kind of data fabrication more visible, it is also a condition for worthwhile replication and meta-analysis. Recommendation Far more than is customary in psychology research practice, research replication must be made part of the basic instruments of the discipline. Research data that underlie psychology publications must be held on file for at least five years after publication, and be made available on request to other scientific practitioners. This rule is to apply not only to raw laboratory data, but also to completed questionnaires, audio and video recordings, etc. The publication must state where the raw data reside and how to access them. INTERIM REPORT REGARDING THE BREACH OF SCIENTIFIC INTEGRITY COMMITTED BY PROF. D.A. STAPEL Tilburg, 31 October 2011

Benefits of data management and sharing

Papers based upon reuse of archived observations now exceed those

based on the use described in the original proposal.

– http://archive.stsci.edu/hst/bibliography/pubstat.html

Research Data Challenges

Challenges: the ‘data deluge’… huge quantities of digital data

– But it’s not just about addressing storage issues.

Opportunities: data reuse, meta-studies, interdisciplinary grand

challenges.

– Increasing awareness of research data as an asset.

– Digital research data has reuse value - important to obtain full return on

public investment.

Results in policy drivers from funders.

– Need improved knowledge of how best to realise these policies.

Increasing emphasis on the role of universities and research

institutions to provide infrastructure and support for RDM.

Drivers: Research Funder Policies

Legislative responsibilities and good practice: FoI, UK Research Integrity Office.

Most funders require applicants to submit data management and sharing plans at grant proposal stage.

– ESRC require a plan to be submitted electronically with the grant

– NERC will require DMP and introducing notion of a ‘data value checklist’

EPSRC places responsibility on institutions to develop a data policy, supporting services and roadmap

Increasing responsibility being placed on universities; policies increasingly prescriptive.

– Summary of UK Funders’ Data Polices: http://www.dcc.ac.uk/resources/policy-and-legal/overview-funders-data-policies

– Sarah Jones, ‘Developments in Research Funder Data Policy’, International Journal of Digital Curation (2012), 7(1), 114–125; http://dx.doi.org/10.2218/ijdc.v7i1.219

EPSRC Research Data Policy Expectations

Research organisations to have RDM policy, advocacy and

support functions. (i, iii)

Research data to be effectively managed and curated throughout the

life-cycle (viii)

Research organisations to maintain public catalogue of research

data holdings, adequate metadata and permanent identifier (v)

Publications to indicate how research data can be accessed (ii)

Data to be retained for 10 years from last access (vii)

Research data management to be adequately resourced from

appropriate funding streams (ix)

Roadmap in place by 1 May 2012

Compliance by 1 May 2015

University Mission

Providing an excellent infrastructure for research is central to a university

mission

– Research data will be managed to the highest standards throughout the research

data lifecycle as part of the University’s commitment to research excellence.

Edinburgh Research Data Policy

– The University of East London recognises that good research demands good

data management in the support of academic integrity, openness and good

stewardship. It will ensure that research data is managed to high standards

throughout the research data lifecycle as part of its commitment to academic

excellence. This policy will ensure UEL is in accordance with Research Councils

UK’s Common Principles on Data Policy as well as the specific requirements of the

Engineering and Physical Sciences Research Council Research Data Management

Policy.

Universities want to have better oversight of research outputs; data like

publications are a reputational asset.

– ‘Sharing research data is an important contributor to the impact of publicly funded

research.’ EPSRC Research Data Policy

Research Data Management: University of Edinburgh Roadmap

Research Integrity, London - Sept 2011 16 LEVEL

PhD student

university research team

individual researcher

supra-university

Where do I safely keep my data from my fieldwork, as

I travel home?

How can I best keep years worth of research

data secure and accessible for when I and others need to re-use it?

How do we ensure compliance to funders’ requirement for several years of open access to

data?

How do we ensure we have access to our research data

after some of the team have left?

How can our research collaborations share data, and make them

available once complete?

Seeking win + win + win + win + win……

Cost-benefits and efficiencies

Benefits and savings through centralised institutional infrastructures.

– 37% projected saving in staff time and infrastructure costs from moving Oxford

Roman Economy Project database to centralised virtual service.

More efficient retrieval of data through more effective RDM systems.

– One-day delay cut to 5 minutes: Estimated time saving for crystallography

researchers to access results from Diamond synchrotron, by deploying digital

processing pipeline & metadata capture system.

Making the Case for RDM, DCC Briefing Paper:

http://www.dcc.ac.uk/resources/briefing-papers/making-case-rdm

Report on the Benefits from the Infrastructure Projects in the JISC Managing

Research Data Programme:

http://www.jisc.ac.uk/whatwedo/programmes/mrd/outputs/benefitsreport.aspx

Further evidence emerging from the University Modernisation Fund RDM

Projects and from the new Managing Research Data Programme.

Why is managing research data important?

JISC considers it a priority to support universities in improving the way

research data is managed and, where appropriate, made available for

reuse.

Research funder policies, legislative frameworks, good practice, open data

agenda

– The outputs of publicly funded research should be publicly available.

– The evidence underpinning research findings should be available for

validation

Good data management is good for research

– More efficient research process, avoidance of data loss, benefits of data reuse

Alignment with university missions.

– Universities want to provide excellent research infrastructure.

– Universities want to have better oversight of research outputs.

Supporting the Research Data Lifecycle

Plan

Create

Use

Appraise Publish

Discover

Reuse

Store

Annotate

Select

Discard Describe

Identify Hand Over?

Access

Supporting the Research Data Lifecycle

Plan

Create

Use

Appraise Publish

Discover

Reuse

Store

Annotate

Select

Discard

Describe

Identify Hand Over?

Access

Leadership and Policy Development

Guidance and Training

Support for Data Management

Planning

RDM Systems and Infrastructure

Publication, Citation and Discovery Mechanisms

First Managing Research Data Programme, 2009-11

First JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11

JISC MRD Outputs Page: http://bit.ly/jiscmrd2009-11-outputs

RDM Infrastructure (guidance/support, systems)

RDM Planning (DMPs, best practice, disciplinary challenges)

RDM Training (targeted at disciplinary needs)

Challenges of data citation and publication

Second Managing Research Data Programme, 2011-13

Second JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11

Grant funding call for projects encouraging research data publication and developing

RDM training materials: http://bit.ly/jiscmrd-2012-Call

RDM Infrastructure (policy, guidance/support, systems)

17 large projects

RDM Planning (DMPs, best practice, disciplinary challenges)

RDM Training (disciplines and librarians)

Innovative data publication

Institutional RDM Services

Institutional RDM Policy sets the tone, aspirations, lays out roles and

responsibilities.

Guidance and training for research staff and support staff.

Support for data management planning.

Research data management infrastructure:

– Systems, procedures and support for managing data during the project lifetime.

– Criteria for selection and retention…

– Archival / repository system for published data with research data catalogue /

metadata store.

Interoperation with institutional administrative / research management

systems.

How to develop RDM services

In development!

Why develop services?

Roles and responsibilities

Process of service development

The components / building blocks • Policy • Data Management Planning • Storage • Data registry.....

Getting started

Examples and case studies to develop into

toolkit Slide Credit: Sarah Jones and Martin Donnelly, DCC

Thank You!

First JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11

JISC MRD Outputs Page: http://bit.ly/jiscmrd2009-11-outputs

Second JISC MRD Programme, 2011-13: http://bit.ly/jiscmrd2009-11

Programme Blog: http://researchdata.jiscinvolve.org/

E-mail: [email protected]

Acknowledgements for slides, materials: Carol Goble, Liz Lyon, Peter Murray-

Rust, David Shotton, Jeff Heywood, Sarah Jones, Martin Donnelly

Leadership and policy development

What should an institutional RDM policy look like?

Institutional RDM policy ‘sets the tone’, lays out commitment, expectations,

roles and responsibilities.

– High level and aspirational?

– Business processes and responsibilities?

– Relation of RDM policies to other policies and procedures?

– Relation of policy and implementation?

DCC on institutional data policies (six published, five in draft):

– http://www.dcc.ac.uk/resources/policy-and-legal/institutional-data-policies

Recent JISCMRD / DCC Workshop:

– http://researchdata.jiscinvolve.org/wp/2012/03/27/developing-institutional-research-

data-management-policies/

17 JISCMRD projects and 18 DCC institutions developing institutional RDM

policies.

National exchange of best practice through JISC and DCC.

JISCMRD Training Projects

Need for subject focussed research data management /

curation training, integrated with PG studies

Five projects to design and pilot (reusable) discipline-

focussed training units for postgraduate courses:

http://www.jisc.ac.uk/whatwedo/programmes/mrd/rdmtrai

n.aspx

Health studies:

http://www.northumbria.ac.uk/sd/academic/ceis/re/isrc/the

mes/rmarea/datum/

Creative arts: http://www.projectcairo.org/

Archaeology, social anthropology:

http://www.lib.cam.ac.uk/preservation/datatrain/

Psychological sciences: http://www.dmtpsych.york.ac.uk/

Social sciences, geographical sciences, clinical

psychology: Project http://bit.ly/RDMantra ; Online course:

http://datalib.edina.ac.uk/mantra/

MANTRA Training Materials, University of Edinburgh

Online course built using OS Xerte

toolkit.

Sections include:

– DMPs

– Organising Data

– File Formats and Transformation

– Documentation and Metadata

– Storage and Security

– Data Protection

– Preservation, sharing and licensing

Also software practicals for users of

SPSS, R, ArcGIS, Nvivo

Research Data MANTRA:

http://datalib.edina.ac.uk/mantra/

New JISCMRD Training Projects

Sheffield: training for LIS PGs and subject/liaison librarians.

UEL: reuse and adaptation of psychology materials, new materials

for computer science; training for library support staff.

QMUL: training for digital music researchers.

Herts: training for researchers in physics and astronomy.

Appraisal and selection

1. Relevance to mission

2. Scientific or historical value

3. Uniqueness

4. Potential for redistribution

5. Non-replicability

6. Economic case

7. Full documentation

Angus Whyte (DCC) and Andrew Wilson (ANDS), How to

Appraise and Select Research Data for Curation

http://www.dcc.ac.uk/node/9098

From prototype to platform…

DataFlow Project: http://www.dataflow.ox.ac.uk/

VIDaaS Project: http://vidaas.oucs.ox.ac.uk/

UMF Programme SaaS for RDM Projects: http://www.jisc.ac.uk/whatwedo/programmes/umf.aspx

The JISC UMF DataFlow Project

DataStage file system

Researchers

DataBank repository

Researchers, other users

SWORD deposit

DataBank is a generic repository, and

can be used to store things other that

research datasets, for example data

management plans (DMPs)

DataStage is a file management system

A DataStage data package consists of

selected data files accompanied by an

RDF metadata manifest, with a SWORD

v2 wrapper

Thank You!

First JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11

JISC MRD Outputs Page: http://bit.ly/jiscmrd2009-11-outputs

Second JISC MRD Programme, 2011-13: http://bit.ly/jiscmrd2009-11

Programme Blog: http://researchdata.jiscinvolve.org/

E-mail: [email protected]

Acknowledgements for slides, materials: Carol Goble, Liz Lyon, Peter Murray-

Rust, David Shotton, Jeff Heywood, Sarah Jones, Martin Donnelly