Data and Donuts: How to write a data management plan

Post on 09-Jan-2017

183 views 0 download

Transcript of Data and Donuts: How to write a data management plan

How to write a data

management plan

C. Tobin Magle, PhDSept. 29, 2016

10:00-11:00 a.m.Morgan Library Computer

Classroom 173

*inspired by content from CU Boulder research computing

What is research data?

• “The recorded factual material commonly accepted in the scientific community as necessary to validate research findings”

- White House Office of Management and Budget

• Reality: anything that is a (digital) product or your research

What is a data management plan?

A description of how you plan to describe, preserve and share your research data.

Often required by funding agencies

DMPTool

• Review requirements from different agencies

• https://dmptool.org/guidance

• Create new DMPs based on funding agency templates

• Search public DMPs

Successful DMPs include

• A data inventory, including type(s) and size

• A strategy for describing the data

• A plan for preserving the data long term

• A method for access to the data

Always make sure to follow funder requirements

Data inventory

• What type of data are you going to collect?

• What file type will be produced?

• What size will these files be? How many files?

• What other research outputs will be produced?• Code/Software?• Templates/protocols?

Data inventorymiRNA sequences

FASTQ files

1 GB per filex 64 strainsx 3 replicates-------------------~200 GB

R scripts for analysis and visualization

Data use tutorials

• What type of data are you going to collect?

• What file type will be produced?

• What size will these files be? How many files?

• What other research outputs will be produced?• Code/Software?• Templates/protocols?

Data formats

• Avoid proprietary formats• Know what software can read your data

Proprietary Format Alternative FormatExcel (.xls, .xlsx) Comma Separated Values (.csv)Word (.doc, .docx) plain text (.txt)PowerPoint (.ppt, .pptx) PDF/A (.pdf)Photoshop (.psd) TIFF (.tif, .tiff)Quicktime (.mov) MPEG-4 (.mp4)MPEG 4 Protected audio (.m4p) MP3 (.mp3)

Exercise: Data InventoryWhat kind of data are you going to collect?

What file type will be produced?

What size will these files be? How many files?

What other research outputs will be produced?

A strategy for describing the data

• Metadata: Relevant information for re-creation and re-use

• Contact info• How data was collected• Details about collection• Date, location of collection• Units

• Can be as simple as a text file

Genomics example (README)This project contains next-generation miRNA sequencing data from 64 mouse strains.

Brain tissue from 10 week old male mice were harvested, stored in RNA later. RNA was extracted using an RNeasy kit, and miRNA libraries were produced using an Illumina kit. They were run on an Illumina mySeq sequencer. The FASTQ Files produced were analyzed in R using Bioconductor.

The data and descriptive will be made available on NCBI in the bioproject (PRJXXXX). The scripts used to analyzed the data are available on github (URL). Tutorials for data use will be made available in the Digital Collections of Colorado (handle).

Contact Tobin Magle (tobin.magle@colostate.edu) for more information. http://orcid.org/0000-0003-3185-7034

Metadata standards• Dublin Core: http://dublincore.org/documents/dcmi-terms/

• Can be applied to anything

• Many discipline specific metadata standards• EML: https://knb.ecoinformatics.org/#external//emlparser/docs/index.html• MIAME: http://fged.org/projects/miame/

• Search for other standards: • http://www.dcc.ac.uk/resources/metadata-standards• https://biosharing.org/standards/

Genomics example (NCBI template)

Exercise: Describe your dataWhat do people need to know to reuse your data?

Are there any discipline-specific metadata standards?

What format will you describe your data in (text, XML, tabular)?

What fields will you include (author, date, format, identifier?)

A plan for preserving the data long term

• What will you do to ensure data are properly stored and preserved?

• Include metadata and other products needed for reuse

• Might change over course of the project

Preservation questions

• What will you store?

• Who will be in charge?

• How long will you store it?

• Where will you store it? • Multiple copies

Recommendations for backing up data

• Store in geographically distinct locations

• Automation: Will you remember to do it manually?

• Security: Are you working with PHI?

Exercise: Preservation planWhat will you store?

Who will be responsible for the data (person or position)?

How long will you store it?

Where will you store it?

How will you back it up?

A method to access the data

• Important to funding agencies• Reproduce existing research• Promote further research

• Must be easily available: • No “by request only”• Embargoes are “ok”

• Data security: consider privacy and IP issues before sharing

Data access and sharing best practices

• Non-proprietary formats

• Include metadata

• Proper storage

• Stable identifier

• Licensing: conditions for reuse

Trusted Repositories: store and share• Discipline specific repositories

• Search: http://service.re3data.org/browse/by-subject/

• Generic: • Figshare - https://figshare.com/• Dryad - http://datadryad.org/

• CSU Digital Repository:• http://lib.colostate.edu/digital-collections/ http://

67.media.tumblr.com/6228cbe58a9652f1a85e8ab1ed08d715/tumblr_inline_n6oukhNlZW1qf11bs.png

Stable identifiers

• URLs break

• Stable identifiers are permanent in a database

• Some provide linking capabilities• DOI – https://

doi.org/10.1109/5.771073

• Handle- http://hdl.handle.net/10217/177356

Licensing

• State your conditions for reuse• Paper citation?

• Disclaimers

• Must justify limitations, describe how you’ll advertise them

• Creative common licenses are a good starting point

Exercise: Access methodsWhere will people be able to access the data?

Does your discipline have a repository?What kind of stable identifier will it have?

What are the conditions for reuse?Are there any limitations to use of these data? Why?

Need help?

• Email: tobin.magle@colostate.edu

• DMPTool: http://dmptool.org/

• Data Management Services website: http://lib.colostate.edu/services/data-management

• Being updated