Conquering Chaos in the Age of Networked Science: Research Data Management

41
Conquering Chaos in the Age of Networked Science: Research Data Management* *Adaptation of the NECDMC First Module Kathryn M. Houk, MLIS Tufts University Hirsh Health Sciences Library Wednesday June 4, 2014 Librarians : Your Partners in Research

Transcript of Conquering Chaos in the Age of Networked Science: Research Data Management

Page 1: Conquering Chaos in the Age of Networked Science: Research Data Management

Conquering Chaos in the Age of Networked Science:

Research Data Management*

*Adaptation of the NECDMC First Module

Kathryn M. Houk, MLISTufts University Hirsh Health Sciences LibraryWednesday June 4, 2014

Librarians: Your Partners in Research

Page 2: Conquering Chaos in the Age of Networked Science: Research Data Management

Today’s Objectives

Recognize what research data is and what data

management entails

Recognize why managing data is important

Identify common data management issues

Learn best practices and resources for managing these

issues

Learn about how the library can help you identify data

management resources, tools, and best practices

Page 3: Conquering Chaos in the Age of Networked Science: Research Data Management

What is Data?

• “Research data, unlike other types of information, is

collected, observed, or created, for purposes of

analysis to produce original research results”

(University of Edinburgh).

• Observational

• Experimental

• Simulation data

• Derived or compiled data

Page 4: Conquering Chaos in the Age of Networked Science: Research Data Management

Why Should I Manage it?

• Transparency & Integrity

• Compliance

Page 5: Conquering Chaos in the Age of Networked Science: Research Data Management

Science & Personal Benefits

• Who uses your data now?

• Who COULD use your data?

• Shared/Open Data

• Scientific progress

• Impact on your career

• Citation counts

Page 6: Conquering Chaos in the Age of Networked Science: Research Data Management

What if I Don’t Consider RDM?

Data Sharing and Management Snafu in 3 Short Acts:

A data management horror story by Karen Hanson, Alisa Surkis and Karen Yacobucci.

http://www.youtube.com/watch?v=N2zK3sAtr-4

Page 7: Conquering Chaos in the Age of Networked Science: Research Data Management

Data Management Planningvs. a DMP

Page 8: Conquering Chaos in the Age of Networked Science: Research Data Management

Data Management Plans

• What types of data will be created?

• Who will own, have access to, and be responsible

for managing these data?

• What equipment and methods will be used to

capture and process data?

• Where will data be stored during and after?

Page 9: Conquering Chaos in the Age of Networked Science: Research Data Management

Simplified Data Management Plan1. Types of data

• What types of data will you be creating or capturing? (experimental measures, observational or qualitative, model simulation, existing)

• How will you capture, create, and/or process the data? (Identify instruments, software, imaging, etc. used)

2. Contextual Details (Metadata) Needed to Make Data Meaningful to others

• What file formats and naming conventions will you be using?

3. Storage, Backup and Security

• Where and on what media will you store the data?

• What is your backup plan for the data?

• How will you manage data security?

4. Provisions for Protection/Privacy

• How are you addressing any ethical or privacy issues (IRB, anonymization of data)?

• Who will own any copyright or intellectual property rights to the data?

5. Policies for re-use

• What restrictions need to be placed on re-use of your data?

6. Policies for access and sharing

• What is the process for gaining access to your data?

7. Plan for archiving and preservation of access

• What is your long-term plan for preservation and maintenance of the data?

Page 10: Conquering Chaos in the Age of Networked Science: Research Data Management

Creating a DMP & Considering Long-Term DM Issues• Read the case study provided

• Your group is assigned a set of questions (labeled Group 1-6) to answer as best you can

• First set of questions are from one section of the simplified DMP

• 2nd set of questions highlight an issue that arises in day-to-day or long-term management of research data (a more detailed level)

• Elect a group speaker

• Each group will discuss their answers

• We will go over the issue associated with your section, common problems, and best practices

Page 11: Conquering Chaos in the Age of Networked Science: Research Data Management

Group 1

• DMP Section 1: Types of Data1. What types (e.g. images, lists of readings, text documents) of

data are being collected for this study?

2. What analytical methods and tools are being used in this study?

3. What types of data will be generated from these analytical tools and methods?

• Detailed Planning1. What naming conventions are being used in the lab?

2. Is there a structure for saving files in the lab?

3. What kind of information would you include in a naming convention for files?

4. What kinds of things would you avoid in naming/labeling files?

Page 12: Conquering Chaos in the Age of Networked Science: Research Data Management

Issue: Records Management

• Does this sound familiar?

• Inconsistently labeled files

• in multiple versions…

• inside poorly structured folders…

• stored on multiple media…

• in multiple locations…

• and in various formats…

Page 13: Conquering Chaos in the Age of Networked Science: Research Data Management
Page 14: Conquering Chaos in the Age of Networked Science: Research Data Management

Issue: Records Management

• Best Practices:

• Avoid special characters in a file name.

• Use capitals or underscores instead of periods or spaces.

• Use 25 or fewer characters.

• Use documented & standardized descriptive information

about the project/experiment.

• Use date format ISO 8601:YYYYMMDD.

• Include a version number.

Page 15: Conquering Chaos in the Age of Networked Science: Research Data Management

Issue: Records Management

Page 16: Conquering Chaos in the Age of Networked Science: Research Data Management

Group 2

• DMP Section 2: Contextual Details (Metadata)

1. What contextual details would the researcher need to document to make her data meaningful to others?

2. How would a lack of naming and labeling conventions impact later data access by other researchers and possibly herself?

• Detailed Planning

1. What general information do you think is needed for scientific data to make it discoverable? (ex. Think of a search screen and a dropdown menu of where you can search for a term: Title, Author, Genre, etc.)

2. Are you aware of any metadata standards for the life or health sciences?

3. Do you think all metadata has to be hand-entered or recorded?

4. How would you ensure lab members knew to collect and record specific information in standard ways?

Page 17: Conquering Chaos in the Age of Networked Science: Research Data Management

Issue: Metadata

• How will someone make sense of your data e.g. the cells

and values of your spreadsheet?

• What universal or disciplinary standards could be used to

label your data?

• How can you describe a data set to make it

discoverable?

Page 18: Conquering Chaos in the Age of Networked Science: Research Data Management
Page 19: Conquering Chaos in the Age of Networked Science: Research Data Management

Issue: Metadata

• Biology and health-specific metadata examples

Page 20: Conquering Chaos in the Age of Networked Science: Research Data Management

Issue: Metadata

• Title

• Creator

• Identifier

• Subject

• Funders

• Rights

• Access information

• Language

• Dates

• Location

• Methodology

• Data processing

• Sources

• List of file names

• File Formats

• File structure

• Variable list

• Code lists

• Versions

• Checksums

Page 21: Conquering Chaos in the Age of Networked Science: Research Data Management

Issue: Metadata

• Best Practices

• Describe the contents of data files

• Define the parameters and the units on the parameter

• Explain the formats for dates, time, geographic coordinates,

and other parameters

• Define any coded values

• Describe quality flags or qualifying values

• Define missing values

Page 22: Conquering Chaos in the Age of Networked Science: Research Data Management

Group 3

• DMP Section 3: Data Backup, Storage, and Security1. Where and on what media will the data from each source be

stored?

2. How, how often, and where will the data be backed up?

3. Are there any security concerns for the data and have they been addressed?

• Detailed Planning1. How many copies of your data do you think you should have

and where should you keep them?

2. Is there any group on campus you think could help you with backup and security/access concerns?

3. What are some good data storage and backup practices you know about or practice?

Page 23: Conquering Chaos in the Age of Networked Science: Research Data Management

Issue: Backup & Security

• How often should data be backed up?

• How many copies of data should you have?

• Where can you store your data?

• How much server space can I get?

Page 24: Conquering Chaos in the Age of Networked Science: Research Data Management

Issue: Backup & Security

• Best Practices• Make 3 copies (original + external/local + external/remote)

• Have them geographically distributed (local vs. remote)

• Use a Hard drive (e.g. Vista backup, Mac Timeline, UNIX rsync) or Tape backup system

• Cloud Storage - some examples of private sector storage resources include: (Amazon S3, Elephant Drive, Jungle Disk, Mozy, Carbonite)

• Unencrypted is ideal for storing your data because it will make it most easily read by you and others in the future…but if you do need to encrypt your data because of human subjects then:• Keep passwords and keys on paper (2 copies), and in a PGP

(pretty good privacy) encrypted digital file

• Uncompressed is also ideal for storage, but if you need to do so to conserve space, limit compression to your 3rd backup copy

Page 25: Conquering Chaos in the Age of Networked Science: Research Data Management

Group 4

• DMP Sections 4. Data protection/privacy and 5. Policies for reuse of data

1. How is the lab addressing any privacy or ethical issues?

2. Who will own any copyright or intellectual property rights to the data?

3. Are there any restrictions to the reuse of the data?

• Detailed Planning

1. Are there any reasons to not share or reuse data? Are these ethical or cultural issues?

2. Will having public funding affect data sharing and reuse differently than having private funding?

3. Who has the right to make decisions about reuse of your data?

Page 26: Conquering Chaos in the Age of Networked Science: Research Data Management

Issue: Ownership & Retention

• Intellectual Property Policy

• IRB data retention policy

• Funders’ data retention policy

• Publishers’ data retention policy

• Federal and State laws

Page 27: Conquering Chaos in the Age of Networked Science: Research Data Management

Issue: Ownership & Retention

• How long is long enough?

Page 28: Conquering Chaos in the Age of Networked Science: Research Data Management

Issue: Ownership & Retention• IRB OHRP Requirements: 45 CFR 46 requires research records to be retained

for at least 3 years after the completion of the research.

• HIPAA Requirements: Any research that involved collecting identifiable health

information is subject to HIPAA requirements. As a result records must be

retained for a minimum of 6 years after each subject signed an authorization.

• FDA Requirements 21 CFR 312.62.c Any research that involved drugs,

devices, or biologics being tested in humans must have records retained for a

period of 2 years following the date a marketing application is approved for the

drug for the indication for which it is being investigated; or, if no application is

to be filed or if the application is not approved for such indication, until 2 years

after the investigation is discontinued and FDA is notified.

• VA Requirements: At present records for any research that involves the VA

must be retained indefinitely per VA federal regulatory requirements.

• Intellectual Property Requirements - Any research data used to support a

patent through must be retained for the life of the patent in accordance with

Intellectual Property Policy.

• Check with your Funder and Publisher Requirements

• Questions of data validity: If there are questions or allegations about the validity

of the data or appropriate conduct of the research, you must retain all of the

original research data until such questions or allegations have been completely

resolved.

Page 29: Conquering Chaos in the Age of Networked Science: Research Data Management

Group 5

• DMP Sections 6: Policies for access and sharing1. How will others be able to gain future access to the study

data?

2. How does the graduate student plan to link her datasets to her published article?

• Detailed Planning1. Could there be a use for the graduate student’s data that was

not used in the published article?

2. Are the data the student collected open formats or proprietary (will people need specialized software to access and interpret the data)?

a) How would this affect future accessibility & reuse?

Page 30: Conquering Chaos in the Age of Networked Science: Research Data Management

Group 6

• DMP Section 7: Plan for archiving and preservation of access

1. What is the long-term strategy for maintaining, curating and archiving the data?

2. Where will the data be stored?

3. What contextual data (data that describes your data) or other related data will be included in the archive?

• Detailed Planning

1. What data should be included in an archive?

2. Do you know of any data repositories that you could use for your data?

3. How can you ensure that your data is discoverable and interpretable?

4. How long should the data be maintained? What factors affect the length of time you retain your data?

Page 31: Conquering Chaos in the Age of Networked Science: Research Data Management

Issue: Long-Term Planning

• What will happen to my data after my project ends?

• How can I appraise the value of my data?

• What are my options for archiving and preserving my

data?

• What are my options for publishing and sharing data?

Page 32: Conquering Chaos in the Age of Networked Science: Research Data Management

Data Formats

• Is the file format open (i.e. open source) or closed

(i.e proprietary)?

• Is a particular software package required to read

and work with the data file? If so, the software

package, version, and operating system platform

should be cited in the metadata

• Do multiple files comprise the data file structure? If

so, that should be specified in the metadata

Page 33: Conquering Chaos in the Age of Networked Science: Research Data Management

Open vs. Proprietary Formats Used in Research Labs

Page 34: Conquering Chaos in the Age of Networked Science: Research Data Management

Issue: Long-Term Planning

• Best Practices

• When choosing a file format, select a consistent

format that can be read well into the future and is

independent of changes in applications.

• Non-proprietary: Open, documented standard,

Unencrypted, Uncompressed, ASCII formatted

files will be readable into the future.

Page 35: Conquering Chaos in the Age of Networked Science: Research Data Management

Issue: Long-Term Planning

• Librarians can help:

• Identify file formats suitable for long-term preservation

• Interpret your funder or publisher’s repository

requirements

• Find and evaluate a suitable repository for your data

• Upload your data sets to a repository

• Help make your data in a repository searchable and

discoverable

• Create a doi and persistent id

• Choosing metadata standards for increased

discoverability

Page 36: Conquering Chaos in the Age of Networked Science: Research Data Management

Issue: Data Stewardship

• Challenges

• Team Science

• Managing Laboratory Notebooks

• Rotating Lab Personnel

Page 37: Conquering Chaos in the Age of Networked Science: Research Data Management

Issue: Data Stewardship

• Best Practices

• Define roles and assign responsibilities for data

management

• Identify skills needed to perform tasks outlined in

DMP and match to available staff

• Develop training plans for continuity

• Assign responsible parties and monitor results

Page 38: Conquering Chaos in the Age of Networked Science: Research Data Management

How the Library Can Help:• Teach you, your lab, or

your classes about data management best practices

• Write a data management and/or sharing plan

• Comply with federal, funder, and publisher data sharing policies

• Find & submit your data to a repository

• Find standards to describe & label your data & data files

• Find a data set

• Cite others’ data

• Publish a data set

• Get a doi for a data set

• Measure the citation impact of your data set

• Build a collection of research data that others can search & access

• Archive & preserve your data

• Learn about copyright & license issues surrounding your data

Page 39: Conquering Chaos in the Age of Networked Science: Research Data Management

Find Help

• Ask your librarian if the library can help!

• Make it known you are interested in receiving assistance from the library

• Ask your IT department for information on storage and security available

• Let them help you make a backup and storage plan

Page 40: Conquering Chaos in the Age of Networked Science: Research Data Management

Learn More

• Data Management Principles & Education:

• Research Data MANTRA

• DataONE: Best Practices

• UK Data Archives

• MIT Data Management and Publishing Guide

• Data Management Plans

• Digital Curation Centre

• DMPTool2

• DataONE: Data Management Planning

Page 41: Conquering Chaos in the Age of Networked Science: Research Data Management

Works Cited

Lamar Soutter Library, University of Massachusetts Medical School. 2014. “New England Collaborative Data Management Curriculum: Module 1.” http://library.umassmed.edu/necdmc.

DataONE. 2013. “Best Practices for Data Management.”

http://www.dataone.org/best-practices.

MIT Libraries. 2013. “Data Management and Publishing.” MIT

http://libraries.mit.edu/guides/subjects/data-management/index.html.

Office of Research Integrity. 2013. “Data Management.” United States Department of Health and Human Services. United States Federal Government.

http://ori.hhs.gov/education/products/rcradmin/topics/data/open.shtml.

Special thanks to Jen Ferguson, Richard Moore and Glenn Gaudette for permission to use their slides.