Data Management Best Practices

21
Data Management Best Practices Ryan Womack and Aletia Morgan Rutgers University Libraries Research Data Services Data Management Best Practices Ryan Womack and Aletia Morgan Rutgers University Libraries Research Data Services October 29, 2013

description

Data Management Best Practices. Data Management Best Practices. Ryan Womack and Aletia Morgan Rutgers University Libraries Research Data Services. Ryan Womack and Aletia Morgan Rutgers University Libraries Research Data Services. October 29, 2013. Why Data Management?. - PowerPoint PPT Presentation

Transcript of Data Management Best Practices

Page 1: Data Management Best Practices

Data Management Best Practices

Ryan Womack and Aletia MorganRutgers University Libraries

Research Data Services

Data Management Best Practices

Ryan Womack and Aletia MorganRutgers University Libraries

Research Data Services

October 29, 2013

Page 3: Data Management Best Practices

Why Data Management?• Best Case Scenarios, Data is…– Robust– Recoverable– Reliable– Reusable– Reproducible– Reputable– Renowned

Page 4: Data Management Best Practices

The Data Lifecycle

Page 5: Data Management Best Practices

Understand your data• Quantity (MB, GB, TB)

– will affect your decisions about how to store, package, transport, and backup your data– Large numbers of discrete files, regardless of size, may be harder to handle– Multiple versions? Frequent updates?

• Format– Seek out open formats or at least commonly used formats (.xlsx is ok, consider .csv)– Both for preservation and reuse, dependency on a single option can be a problem, even

if it is the most convenient in the short-term.

• Rights and permissions– Reuse of other data sources (licensed or otherwise) may require investigation and

restrict data sharing options– Confidentiality of human subjects or guarding of information for patent protection may

be reasons to restrict data– Give appropriate credit to others’ contributions

Page 6: Data Management Best Practices

Preserving your Data

What types of data need to be managed or and stored over the course of your project?

• Raw Data

• Working Data

• Processed or Final Data

• Preserved Data for Possible Re-Use

Page 7: Data Management Best Practices

Data Preservation Considerations• Access Control

• Versioning

• Backup– Local– Shared• Campus• Offsite

Page 8: Data Management Best Practices

Security

Security includes both

• Physical Security

• Logical/Network/Password Security

Page 9: Data Management Best Practices

Organizing your Data• Who controls the data for which parts of the project?

• A logical structure and plan will help during and after data creation, for directories and filenames (e.g. project directory has standard locations for raw data, analysis, code, graphs, etc.)

• Project folders with well-documented naming conventions (e.g., date of data creation as part of file name in a structured format (ISO is yyyy-mm-dd). See ARM example.

• Versioning scheme should be agreed on and documented.

• Goal is to have unique and understandable identifiers so that different parts of a project can merge and be handled easily over time, even if participants and conditions change.

Page 10: Data Management Best Practices

Documenting your Data• Documentation helps

– with your own reuse of the data at a later date, – others in your research group working with the data– with the long-term reuse potential of your data

• A Codebook is a structured way of explaining the contents of your data file

• Readme files are a well-understood way of communicating information about the contents and setup of your data

• Documentation can also take the form of a simple text or Word document

• Include any explanations of experimental methods, software code run on the data, and any other tools needed to work with the data

• Your previous work on creating and describing a consistent organization for your data will help here.

Page 11: Data Management Best Practices

Reproducible Research• Ideally, someone else can grab your data project as a complete

bundle of data, documentation, and software code, and recreate the analysis to get exactly the same results

• Reports and data can be integrated so that live analysis run on actual data can be placed in reports (some R packages do this).

• Many initiatives are advancing the concept of reproducible research.

• This high standard of evidence and validation is an assurance that data and conclusions are not flawed (or faked).

• Good data management practices lay the groundwork for success in reproducible research

Page 12: Data Management Best Practices

Data Management Plans• Many sponsored research agencies now require a "Data

Management Plan" (DMP) as a component of any proposal

• The DMP is a formal document that outlines what the PI will do with data during and after completion of a funded research project

• The specific requirements for the DMP vary by funder, and by research subject

Page 13: Data Management Best Practices

Why The DMP Requirement?• Ensure the preservation of important research data

• Support the potential re-use of grant funded data by other researchers to validate and potentially extend the value of the data

• Improve accountability for use of public revenue to support research

• Provide for the reasonable access to research data to be consistent with the Freedom of Information Act

Page 14: Data Management Best Practices

Benefits of a Good DMP• Improved competitiveness for grant programs; a clear and

complete Data Management Plan will support the project’s evaluation plan

• Enhanced PI efficiency – writing a DMP encourages the creation of a structured plan for managing research data throughout the life of the project and beyond

• Long-term protection and preservation of RU research data

Page 15: Data Management Best Practices

Who’s Asking for a DMP?• National Science Foundation http://www.nsf.gov/pubs/policydocs/pappguide/nsf11001/aag_6.jsp#VID4

• Centers for Disease Control and Prevention http://www.cdc.gov/od/foia/policies/sharing.htm• Department of Energy http://www.cio.energy.gov/policy-guidance/federal_regulations.htm• Department of Defense http://www.dtic.mil/whs/directives/corres/pdf/320014p.pdf• Environmental Protection Agency http://www.epa.gov/quality/informationguidelines/documents/EPA_InfoQualityGuidelines.pdf• Institute of Museum and Library Services http://www.imls.gov/applicants/forms/DigitalProducts.pdf• NASA http://nasascience.nasa.gov/earth-science/earth-science-data-centers/data-and-information-policy• National Endowment for the Humanities http://www.neh.gov/grants/guidelines/pdf/DataManagementPlans.pdf• National Institute of Justice http://www.ojp.usdoj.gov/nij/funding/data-resources-program/welcome.htm• National Institute of Standards and Technology http://www.nist.gov/director/quality_standards.htm• United States Department of Agriculture http://www.csrees.usda.gov/• United State Department of Education http://ies.ed.gov/funding/datasharing_policy.asp

• Many academic journals have also begun to require data sharing as part of the submission process. For a partial list of journals with data sharing mandates for their published articles, see the following:

• http://oad.simmons.edu/oadwiki/Journal_open-data_policies http://gking.harvard.edu/pages/data-sharing-and-replication

As described by Portland State University Libraryhttp://library.pdx.edu/digital-scholarship/data-mgmt-plans/who-requires-dmps.html

Page 16: Data Management Best Practices

The NSF DMP Should Include• [data attributes] The types of data, samples, physical collections, software,

curriculum materials, and other materials to be produced in the course of the project;

• [metadata] The standards to be used for data and metadata format and content

• [security policies] Policies for access and sharing including provisions for appropriate protection of privacy, confidentiality, security, intellectual property, or other rights or requirements, including the right to embargo data for a specified time period to allow first publication and thorough use of the data;

• [use policies] Policies and provisions for fair re-use, re-distribution, and the production of derivatives; and

• [preservation] Plans for archiving data, samples, and other research products, and for preservation of access to them.

Page 17: Data Management Best Practices

DMP Web SupportOperated by the California Digital Library, the DMPTool is a site supporting general DMP development with some school-specific guidance

RUL Data Resources – offers information about services and experts to support your data management efforts

Page 18: Data Management Best Practices

Discoverable Data• Publicly available archives such as Dryad, Dataverse, ICPSR, and

more allow other researchers to easily discover and reuse data• Metadata provide standardized terminology for searching and

discovering data matching defined characteristics. Unlike a data upload to a website, metadata provides a well-structured way for computer indexing of the data.

• Your discipline may have well-defined metadata standards (or not). Check with your librarian if you are in doubt.

• Databib and re3data are directories of research data repositories that can be used to discover possible locations for you to deposit and share your data.

Page 19: Data Management Best Practices

Citing Data• By publicly sharing your data via an established

repository, you will typically get a DOI (Digital Object Identifier) or other persistent URL. This will serve as a permanent pointer to the data

• You can cite your data, and others’ data, with DOI’s. This is more precise and easier for others to work with than a vague reference to the data by title, author, or even citing a paper that uses the data.

• Data citation increases the impact of your research! This completes the data lifecycle.

Page 20: Data Management Best Practices

The Data Lifecycle

Page 21: Data Management Best Practices

Further References• Australian National Data Service: Data

Management for Researchers• Australian National University: Data Management• CIESIN: Geospatial Electronic Records• ICPSR Guide to Social Science Data Preparation and Archiving

(pdf):• Oak Ridge National Laboratory: Best Practices for Preparing E

nvironmental Data Sets to Share and Archive

• UK Data Archive: Create & Manage Data and Managing and Sharing Data: a Best Practice Guide for Researchers (pdf).