Simon Hodson

Post on 11-Nov-2014

1.347 views 0 download

Tags:

description

I shall provide a summary of JISC work in the area of ‘Big Data’. My primary focus will be on how to manage the huge amount of research data produced in UK Universities. I shall cover the history of JISC interventions to improve research data management and look at next steps. I shall touch on some other areas of work like ‘Digging into Data’ and web archiving which also deal with ‘big data’.

Transcript of Simon Hodson

JISC and the Big (Research) Data Challenge

Simon HodsonJISC Programme Manager, Managing Research Data

Thursday 10 May 2012

Eduserv Symposium: Big Data

Why is managing research data important?

JISC considers it a priority to support universities in improving the way research data is managed and, where appropriate, made available for

reuse.

� Research funder policies, legislative frameworks, good practice, open data

agenda

– The outputs of publicly funded research should be publicly available.

– The evidence underpinning research findings should be available for

validation

� Good data management is good for research

– More efficient research process, avoidance of data loss, benefits of data reuse

� Alignment with university missions.

– Universities want to provide excellent research infrastructure.

– Universities want to have better oversight of research outputs.

Estimated Research Data Requirements

Two Russell Group Universities

� Estimated current data holdings of c.2PB (managed and unmanaged)

� Currently provide 800TB/300TB in a central storage facility, not all of which is

used (but will be full in 12-18 months)…

� Significant amount of data in temporary storage, external drives etc…

� ‘the more groups we go to talk to, the more we're hearing of significant data holdings on external hard drives and small RAID systems’

1994 Group University

� No central research data provision.

� Faculties (medicine, business, humanities) have 20-30TB each.

� Engineering currently has 170TB faculty system, urgent need to expand.

� But… one group, recently interviewed, currently has 250TB, only half in ‘managed storage’; will reach PB levels in the next few years.

DUDs

The data centre

under the desk (or

in a back pack) is

not adequate.

Why manage research data?

� Not just about storage or avoiding data loss…!

� It’s about knowing what to keep and what to throw away…

� Important to extract maximum return on investment from publicly funded research.

� Access to underlying data is essential for verification and therefore research integrity.

� Opportunities to extract more knowledge from existing data, new analysis.

� It’s about making the most out of data created!

Making Data Meaningful and Reusable

JISC and Research Data

1. Understanding the problem (pre-2007-2009)

2. Prototyping solutions (2009-11)

3. Hardening solutions and building institutional capacity (2011-13)

4. Developing elements of national infrastructure (2013+)

1: Understanding the Problem

Key JISC reports:

� Dealing with Data:

http://www.ukoln.ac.uk/ukoln/staff/

e.j.lyon/reports/dealing_with_data_

report-final.pdf

� Keeping Research Data Safe:

http://www.jisc.ac.uk/media/docum

ents/publications/keepingresearch

datasafe0408.pdf

� Skills, Role, Career Structure of

Data Scientists and Curators:

http://www.jisc.ac.uk/media/docum

ents/programmes/digitalrepositorie

s/dataskillscareersfinalreport.pdf

Other:

� UKRDS Scoping Study:

http://www.ukrds.ac.uk/resources/

Prototyping Solutions:First MRD Programme, 2009-11

� First JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11

� JISC MRD Outputs Page: http://bit.ly/jiscmrd2009-11-outputs

RDM Infrastructure (guidance/support, systems)

RDM Planning (DMPs, best practice, disciplinary challenges)

RDM Training (targeted at disciplinary needs)

Challenges of data citation and publication

Building Institutional Capacity:First MRD Programme, 2009-11

� Second JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11

� Projects shortly to be announced for research data publication and developing RDM

training materials: http://bit.ly/jiscmrd-2012-Call

RDM Infrastructure (policy, guidance/support, systems)17 large projects

RDM Planning (DMPs, best practice, disciplinary challenges)

RDM Training (disciplines and libraries/research support)

Innovative data publication

A holistic approach…

Leadership and Policy Development

Guidance and Training

Support for Data Management

Planning

RDM Systems and Infrastructure

Publication, Citation and Discovery Mechanisms

How to develop RDM services

In development!

Why develop services?

Roles and responsibilities

Process of service development

The components / building blocks

• Policy

• Data Management

Planning

• Storage

• Data registry.....

Getting started

Examples and

case studies to

develop into

toolkitSlide Credit: Sarah Jones and Martin Donnelly, DCC

Next steps? Elements of a national infrastructure

� Journals are increasingly implementing policies requiring availability of underlying data.

� Registry of Journal Data Policies to help researchers and research

administrators understand the implications and changing landscape.

� Universities are developing catalogues of research data holdings.

� National registry of research data to facilitate discovery, reuse; better

understanding of impact and research landscape.

Thank You!

� First JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11

� JISC MRD Outputs Page: http://bit.ly/jiscmrd2009-11-outputs

� Second JISC MRD Programme, 2011-13: http://bit.ly/jiscmrd2009-11

� Programme Blog: http://researchdata.jiscinvolve.org/

� MRD Project Blogs: http://tiny.cc/MRDblogs

� Twitter: #jiscmrd

� E-mail: s.hodson@jisc.ac.uk

� Acknowledgements for slides, content: Carol Goble, Liz Lyon, Peter Murray-

Rust, David Shotton, Martin Donnelly, Sarah Jones.

From prototype to platform…

DataFlow Project: http://www.dataflow.ox.ac.uk/

UMF Programme SaaS for RDM Projects: http://www.jisc.ac.uk/whatwedo/programmes/umf.aspx

The JISC UMF DataFlow Project

DataStage file system

Researchers

DataBank repository

Researchers, other users

SWORD deposit

� DataBank is a generic repository, and

can be used to store things other that

research datasets, for example data

management plans (DMPs)

� DataStage is a file management system

� A DataStage data package consists of

selected data files accompanied by an

RDF metadata manifest, with a SWORD

v2 wrapper