Research Data Management: the institutional and … Data Management: the institutional and national...
Transcript of Research Data Management: the institutional and … Data Management: the institutional and national...
Research Data Management: the institutional and
national challenges
Simon Hodson
JISC Programme Manager, Managing Research Data
Wednesday 23 May 2012
Approaches to Research Data Management, UWE, Bristol
Deluges and inundations…
A Surfboard for Riding the Wave, a four
country action programme on research data
http://bit.ly/KE_Surfboard
Hey, Trevethen, ‘The Data Deluge: an e-
Science Perspective’ (2003):
http://eprints.soton.ac.uk/257648/
Riding the Wave:
http://cordis.europa.eu/fp7/i
ct/e-infrastructure/docs/hlg-
sdi-report.pdf
PDB
GenBank
UniProt
Pfam
Spreadsheets, Notebooks
Local, Lost
High throughput experimental methods
Industrial scale
Commons based production
Publicly available data sets
Preserved
CATH, SCOP
(Protein Structure
Classification)
ChemSpider
Slide Credit: Carole Goble, Liz Lyon
Volume: the long tail…
Estimated Research Data Requirements
Two Russell Group Universities
Estimated current data holdings of c.2PB (managed and unmanaged)
Currently provide 800TB/300TB in a central storage facility, not all of which is
used (but will be full in 12-18 months)…
Significant amount of data in temporary storage, external drives etc…
‘the more groups we go to talk to, the more we're hearing of significant
data holdings on external hard drives and small RAID systems’
1994 Group University
No central research data provision.
Faculties (medicine, business, humanities) have 20-30TB each.
Engineering currently has 170TB faculty system, urgent need to expand.
But… one group, recently interviewed, currently has 250TB, only half in
‘managed storage’; will reach PB levels in the next few years.
Evidence that significant data loss occurs…
‘Departments typically don’t have guidelines or norms for personal
back-up and researcher procedure, knowledge and diligence varies
tremendously. Many have experienced moderate to catastrophic
data loss.’
– Incremental Project Scoping Study and Implementation Plan
http://www.lib.cam.ac.uk/preservation/incremental/documents/Incremental_Scoping_Report_1
70910.pdf
‘The current environment is such that responsibility for good data
management is devolved to individual researchers and in practice PIs
set the 'rules' and establish the cultural practices of the research
groups and this means there is good data management practice
going on in pockets but no consistency across groups. There is also
consequently a high risk of data losses by a number of means’.
– MaDAM Project Requirements Analysis
http://www.merc.ac.uk/sites/default/files/MaDAM_Requirements%20_%20gap%20analysis-
v1.4-FINAL.pdf
Can we quantify the benefits
of reducing data loss?
JISCMRD Project Survey (interim analysis)
262 respondents.
23.3% of respondents (61) have lost research data
– One respondent had lost all their research data as it had not been backed
up.
– 20 had lost one week’s work
– 23 had lost one day’s work
Why manage research data?
Not just about storage or avoiding data loss…!
It’s about knowing what to keep and what to throw away…
Important to extract maximum return on investment from publicly
funded research.
Access to underlying data is essential for verification and therefore
research integrity.
Opportunities to extract more knowledge from existing data, new
analysis: new research questions, data integration, meta studies.
It’s about making the most out of data created!
Finally, a lamentable element of the culture in social psychology and psychology research is for everyone to keep their own data and not make them available to a public archive. This is a problem on a much larger scale, as has recently become apparent. Even where a journal demands data accessibility, authors usually do not comply (Wicherts et al. 2006). Archiving and public access to research data not only makes this kind of data fabrication more visible, it is also a condition for worthwhile replication and meta-analysis. Recommendation Far more than is customary in psychology research practice, research replication must be made part of the basic instruments of the discipline. Research data that underlie psychology publications must be held on file for at least five years after publication, and be made available on request to other scientific practitioners. This rule is to apply not only to raw laboratory data, but also to completed questionnaires, audio and video recordings, etc. The publication must state where the raw data reside and how to access them. INTERIM REPORT REGARDING THE BREACH OF SCIENTIFIC INTEGRITY COMMITTED BY PROF. D.A. STAPEL Tilburg, 31 October 2011
Benefits of data management and sharing
Papers based upon reuse of archived observations now exceed those
based on the use described in the original proposal.
– http://archive.stsci.edu/hst/bibliography/pubstat.html
Research Data Challenges
Challenges: the ‘data deluge’… huge quantities of digital data
– But it’s not just about addressing storage issues.
Opportunities: data reuse, meta-studies, interdisciplinary grand
challenges.
– Increasing awareness of research data as an asset.
– Digital research data has reuse value - important to obtain full return on
public investment.
Results in policy drivers from funders.
– Need improved knowledge of how best to realise these policies.
Increasing emphasis on the role of universities and research
institutions to provide infrastructure and support for RDM.
Drivers: Research Funder Policies
Legislative responsibilities and good practice: FoI, UK Research Integrity Office.
Most funders require applicants to submit data management and sharing plans at grant proposal stage.
– ESRC require a plan to be submitted electronically with the grant
– NERC will require DMP and introducing notion of a ‘data value checklist’
EPSRC places responsibility on institutions to develop a data policy, supporting services and roadmap
Increasing responsibility being placed on universities; policies increasingly prescriptive.
– Summary of UK Funders’ Data Polices: http://www.dcc.ac.uk/resources/policy-and-legal/overview-funders-data-policies
– Sarah Jones, ‘Developments in Research Funder Data Policy’, International Journal of Digital Curation (2012), 7(1), 114–125; http://dx.doi.org/10.2218/ijdc.v7i1.219
EPSRC Research Data Policy Expectations
Research organisations to have RDM policy, advocacy and
support functions. (i, iii)
Research data to be effectively managed and curated throughout the
life-cycle (viii)
Research organisations to maintain public catalogue of research
data holdings, adequate metadata and permanent identifier (v)
Publications to indicate how research data can be accessed (ii)
Data to be retained for 10 years from last access (vii)
Research data management to be adequately resourced from
appropriate funding streams (ix)
Roadmap in place by 1 May 2012
Compliance by 1 May 2015
University Mission
Providing an excellent infrastructure for research is central to a university
mission
– Research data will be managed to the highest standards throughout the research
data lifecycle as part of the University’s commitment to research excellence.
Edinburgh Research Data Policy
– The University of East London recognises that good research demands good
data management in the support of academic integrity, openness and good
stewardship. It will ensure that research data is managed to high standards
throughout the research data lifecycle as part of its commitment to academic
excellence. This policy will ensure UEL is in accordance with Research Councils
UK’s Common Principles on Data Policy as well as the specific requirements of the
Engineering and Physical Sciences Research Council Research Data Management
Policy.
Universities want to have better oversight of research outputs; data like
publications are a reputational asset.
– ‘Sharing research data is an important contributor to the impact of publicly funded
research.’ EPSRC Research Data Policy
Research Data Management: University of Edinburgh Roadmap
Research Integrity, London - Sept 2011 16 LEVEL
PhD student
university research team
individual researcher
supra-university
Where do I safely keep my data from my fieldwork, as
I travel home?
How can I best keep years worth of research
data secure and accessible for when I and others need to re-use it?
How do we ensure compliance to funders’ requirement for several years of open access to
data?
How do we ensure we have access to our research data
after some of the team have left?
How can our research collaborations share data, and make them
available once complete?
Seeking win + win + win + win + win……
Cost-benefits and efficiencies
Benefits and savings through centralised institutional infrastructures.
– 37% projected saving in staff time and infrastructure costs from moving Oxford
Roman Economy Project database to centralised virtual service.
More efficient retrieval of data through more effective RDM systems.
– One-day delay cut to 5 minutes: Estimated time saving for crystallography
researchers to access results from Diamond synchrotron, by deploying digital
processing pipeline & metadata capture system.
Making the Case for RDM, DCC Briefing Paper:
http://www.dcc.ac.uk/resources/briefing-papers/making-case-rdm
Report on the Benefits from the Infrastructure Projects in the JISC Managing
Research Data Programme:
http://www.jisc.ac.uk/whatwedo/programmes/mrd/outputs/benefitsreport.aspx
Further evidence emerging from the University Modernisation Fund RDM
Projects and from the new Managing Research Data Programme.
Why is managing research data important?
JISC considers it a priority to support universities in improving the way
research data is managed and, where appropriate, made available for
reuse.
Research funder policies, legislative frameworks, good practice, open data
agenda
– The outputs of publicly funded research should be publicly available.
– The evidence underpinning research findings should be available for
validation
Good data management is good for research
– More efficient research process, avoidance of data loss, benefits of data reuse
Alignment with university missions.
– Universities want to provide excellent research infrastructure.
– Universities want to have better oversight of research outputs.
Supporting the Research Data Lifecycle
Plan
Create
Use
Appraise Publish
Discover
Reuse
Store
Annotate
Select
Discard Describe
Identify Hand Over?
Access
Supporting the Research Data Lifecycle
Plan
Create
Use
Appraise Publish
Discover
Reuse
Store
Annotate
Select
Discard
Describe
Identify Hand Over?
Access
Leadership and Policy Development
Guidance and Training
Support for Data Management
Planning
RDM Systems and Infrastructure
Publication, Citation and Discovery Mechanisms
First Managing Research Data Programme, 2009-11
First JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11
JISC MRD Outputs Page: http://bit.ly/jiscmrd2009-11-outputs
RDM Infrastructure (guidance/support, systems)
RDM Planning (DMPs, best practice, disciplinary challenges)
RDM Training (targeted at disciplinary needs)
Challenges of data citation and publication
Second Managing Research Data Programme, 2011-13
Second JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11
Grant funding call for projects encouraging research data publication and developing
RDM training materials: http://bit.ly/jiscmrd-2012-Call
RDM Infrastructure (policy, guidance/support, systems)
17 large projects
RDM Planning (DMPs, best practice, disciplinary challenges)
RDM Training (disciplines and librarians)
Innovative data publication
Institutional RDM Services
Institutional RDM Policy sets the tone, aspirations, lays out roles and
responsibilities.
Guidance and training for research staff and support staff.
Support for data management planning.
Research data management infrastructure:
– Systems, procedures and support for managing data during the project lifetime.
– Criteria for selection and retention…
– Archival / repository system for published data with research data catalogue /
metadata store.
Interoperation with institutional administrative / research management
systems.
How to develop RDM services
In development!
Why develop services?
Roles and responsibilities
Process of service development
The components / building blocks • Policy • Data Management Planning • Storage • Data registry.....
Getting started
Examples and case studies to develop into
toolkit Slide Credit: Sarah Jones and Martin Donnelly, DCC
Thank You!
First JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11
JISC MRD Outputs Page: http://bit.ly/jiscmrd2009-11-outputs
Second JISC MRD Programme, 2011-13: http://bit.ly/jiscmrd2009-11
Programme Blog: http://researchdata.jiscinvolve.org/
E-mail: [email protected]
Acknowledgements for slides, materials: Carol Goble, Liz Lyon, Peter Murray-
Rust, David Shotton, Jeff Heywood, Sarah Jones, Martin Donnelly
Leadership and policy development
What should an institutional RDM policy look like?
Institutional RDM policy ‘sets the tone’, lays out commitment, expectations,
roles and responsibilities.
– High level and aspirational?
– Business processes and responsibilities?
– Relation of RDM policies to other policies and procedures?
– Relation of policy and implementation?
DCC on institutional data policies (six published, five in draft):
– http://www.dcc.ac.uk/resources/policy-and-legal/institutional-data-policies
Recent JISCMRD / DCC Workshop:
– http://researchdata.jiscinvolve.org/wp/2012/03/27/developing-institutional-research-
data-management-policies/
17 JISCMRD projects and 18 DCC institutions developing institutional RDM
policies.
National exchange of best practice through JISC and DCC.
JISCMRD Training Projects
Need for subject focussed research data management /
curation training, integrated with PG studies
Five projects to design and pilot (reusable) discipline-
focussed training units for postgraduate courses:
http://www.jisc.ac.uk/whatwedo/programmes/mrd/rdmtrai
n.aspx
Health studies:
http://www.northumbria.ac.uk/sd/academic/ceis/re/isrc/the
mes/rmarea/datum/
Creative arts: http://www.projectcairo.org/
Archaeology, social anthropology:
http://www.lib.cam.ac.uk/preservation/datatrain/
Psychological sciences: http://www.dmtpsych.york.ac.uk/
Social sciences, geographical sciences, clinical
psychology: Project http://bit.ly/RDMantra ; Online course:
http://datalib.edina.ac.uk/mantra/
MANTRA Training Materials, University of Edinburgh
Online course built using OS Xerte
toolkit.
Sections include:
– DMPs
– Organising Data
– File Formats and Transformation
– Documentation and Metadata
– Storage and Security
– Data Protection
– Preservation, sharing and licensing
Also software practicals for users of
SPSS, R, ArcGIS, Nvivo
Research Data MANTRA:
http://datalib.edina.ac.uk/mantra/
New JISCMRD Training Projects
Sheffield: training for LIS PGs and subject/liaison librarians.
UEL: reuse and adaptation of psychology materials, new materials
for computer science; training for library support staff.
QMUL: training for digital music researchers.
Herts: training for researchers in physics and astronomy.
Appraisal and selection
1. Relevance to mission
2. Scientific or historical value
3. Uniqueness
4. Potential for redistribution
5. Non-replicability
6. Economic case
7. Full documentation
Angus Whyte (DCC) and Andrew Wilson (ANDS), How to
Appraise and Select Research Data for Curation
http://www.dcc.ac.uk/node/9098
http://data.blogs.ilrt.org/2012/02/03/data-bris-architecture/
From prototype to platform…
DataFlow Project: http://www.dataflow.ox.ac.uk/
VIDaaS Project: http://vidaas.oucs.ox.ac.uk/
UMF Programme SaaS for RDM Projects: http://www.jisc.ac.uk/whatwedo/programmes/umf.aspx
The JISC UMF DataFlow Project
DataStage file system
Researchers
DataBank repository
Researchers, other users
SWORD deposit
DataBank is a generic repository, and
can be used to store things other that
research datasets, for example data
management plans (DMPs)
DataStage is a file management system
A DataStage data package consists of
selected data files accompanied by an
RDF metadata manifest, with a SWORD
v2 wrapper
http://www1.uwe.ac.uk/library/usingthelibrary/servicesforresearchers/datamanagement/managingresearchdata/projectoutputs/workpackages1and2.aspx
Thank You!
First JISC MRD Programme, 2009-11: http://bit.ly/jiscmrd2009-11
JISC MRD Outputs Page: http://bit.ly/jiscmrd2009-11-outputs
Second JISC MRD Programme, 2011-13: http://bit.ly/jiscmrd2009-11
Programme Blog: http://researchdata.jiscinvolve.org/
E-mail: [email protected]
Acknowledgements for slides, materials: Carol Goble, Liz Lyon, Peter Murray-
Rust, David Shotton, Jeff Heywood, Sarah Jones, Martin Donnelly