Simon Hodson

17
JISC and the Big (Research) Data Challenge Simon Hodson JISC Programme Manager, Managing Research Data Thursday 10 May 2012 Eduserv Symposium: Big Data

description

I shall provide a summary of JISC work in the area of ‘Big Data’. My primary focus will be on how to manage the huge amount of research data produced in UK Universities. I shall cover the history of JISC interventions to improve research data management and look at next steps. I shall touch on some other areas of work like ‘Digging into Data’ and web archiving which also deal with ‘big data’.

Transcript of Simon Hodson

Page 1: Simon Hodson

JISC and the Big (Research) Data Challenge

Simon HodsonJISC Programme Manager, Managing Research Data

Thursday 10 May 2012

Eduserv Symposium: Big Data

Page 2: Simon Hodson

Why is managing research data important?

JISC considers it a priority to support universities in improving the way research data is managed and, where appropriate, made available for

reuse.

� Research funder policies, legislative frameworks, good practice, open data

agenda

– The outputs of publicly funded research should be publicly available.

– The evidence underpinning research findings should be available for

validation

� Good data management is good for research

– More efficient research process, avoidance of data loss, benefits of data reuse

� Alignment with university missions.

– Universities want to provide excellent research infrastructure.

– Universities want to have better oversight of research outputs.

Page 3: Simon Hodson

Estimated Research Data Requirements

Two Russell Group Universities

� Estimated current data holdings of c.2PB (managed and unmanaged)

� Currently provide 800TB/300TB in a central storage facility, not all of which is

used (but will be full in 12-18 months)…

� Significant amount of data in temporary storage, external drives etc…

� ‘the more groups we go to talk to, the more we're hearing of significant data holdings on external hard drives and small RAID systems’

1994 Group University

� No central research data provision.

� Faculties (medicine, business, humanities) have 20-30TB each.

� Engineering currently has 170TB faculty system, urgent need to expand.

� But… one group, recently interviewed, currently has 250TB, only half in ‘managed storage’; will reach PB levels in the next few years.

Page 4: Simon Hodson

DUDs

The data centre

under the desk (or

in a back pack) is

not adequate.

Page 5: Simon Hodson

Why manage research data?

� Not just about storage or avoiding data loss…!

� It’s about knowing what to keep and what to throw away…

� Important to extract maximum return on investment from publicly funded research.

� Access to underlying data is essential for verification and therefore research integrity.

� Opportunities to extract more knowledge from existing data, new analysis.

� It’s about making the most out of data created!

Page 6: Simon Hodson

Making Data Meaningful and Reusable

Page 7: Simon Hodson

JISC and Research Data

1. Understanding the problem (pre-2007-2009)

2. Prototyping solutions (2009-11)

3. Hardening solutions and building institutional capacity (2011-13)

4. Developing elements of national infrastructure (2013+)

Page 8: Simon Hodson

1: Understanding the Problem

Key JISC reports:

� Dealing with Data:

http://www.ukoln.ac.uk/ukoln/staff/

e.j.lyon/reports/dealing_with_data_

report-final.pdf

� Keeping Research Data Safe:

http://www.jisc.ac.uk/media/docum

ents/publications/keepingresearch

datasafe0408.pdf

� Skills, Role, Career Structure of

Data Scientists and Curators:

http://www.jisc.ac.uk/media/docum

ents/programmes/digitalrepositorie

s/dataskillscareersfinalreport.pdf

Other:

� UKRDS Scoping Study:

http://www.ukrds.ac.uk/resources/

Page 9: Simon Hodson

Prototyping Solutions:First MRD Programme, 2009-11

� First JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11

� JISC MRD Outputs Page: http://bit.ly/jiscmrd2009-11-outputs

RDM Infrastructure (guidance/support, systems)

RDM Planning (DMPs, best practice, disciplinary challenges)

RDM Training (targeted at disciplinary needs)

Challenges of data citation and publication

Page 10: Simon Hodson

Building Institutional Capacity:First MRD Programme, 2009-11

� Second JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11

� Projects shortly to be announced for research data publication and developing RDM

training materials: http://bit.ly/jiscmrd-2012-Call

RDM Infrastructure (policy, guidance/support, systems)17 large projects

RDM Planning (DMPs, best practice, disciplinary challenges)

RDM Training (disciplines and libraries/research support)

Innovative data publication

Page 11: Simon Hodson

A holistic approach…

Leadership and Policy Development

Guidance and Training

Support for Data Management

Planning

RDM Systems and Infrastructure

Publication, Citation and Discovery Mechanisms

Page 12: Simon Hodson

How to develop RDM services

In development!

Why develop services?

Roles and responsibilities

Process of service development

The components / building blocks

• Policy

• Data Management

Planning

• Storage

• Data registry.....

Getting started

Examples and

case studies to

develop into

toolkitSlide Credit: Sarah Jones and Martin Donnelly, DCC

Page 13: Simon Hodson

Next steps? Elements of a national infrastructure

� Journals are increasingly implementing policies requiring availability of underlying data.

� Registry of Journal Data Policies to help researchers and research

administrators understand the implications and changing landscape.

� Universities are developing catalogues of research data holdings.

� National registry of research data to facilitate discovery, reuse; better

understanding of impact and research landscape.

Page 14: Simon Hodson
Page 15: Simon Hodson

Thank You!

� First JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11

� JISC MRD Outputs Page: http://bit.ly/jiscmrd2009-11-outputs

� Second JISC MRD Programme, 2011-13: http://bit.ly/jiscmrd2009-11

� Programme Blog: http://researchdata.jiscinvolve.org/

� MRD Project Blogs: http://tiny.cc/MRDblogs

� Twitter: #jiscmrd

� E-mail: [email protected]

� Acknowledgements for slides, content: Carol Goble, Liz Lyon, Peter Murray-

Rust, David Shotton, Martin Donnelly, Sarah Jones.

Page 16: Simon Hodson

From prototype to platform…

DataFlow Project: http://www.dataflow.ox.ac.uk/

UMF Programme SaaS for RDM Projects: http://www.jisc.ac.uk/whatwedo/programmes/umf.aspx

Page 17: Simon Hodson

The JISC UMF DataFlow Project

DataStage file system

Researchers

DataBank repository

Researchers, other users

SWORD deposit

� DataBank is a generic repository, and

can be used to store things other that

research datasets, for example data

management plans (DMPs)

� DataStage is a file management system

� A DataStage data package consists of

selected data files accompanied by an

RDF metadata manifest, with a SWORD

v2 wrapper