Poster RDAP13: A Workflow for Depositing to a Research Data Repository: A Case Study for Archiving...

1
An Example Workflow for Depositing to a Research Data Repository: A Case Study for Archiving Publication Data Betsy Gunia, David Fearon, Benjamin Brosius, Tim DiLauro | JHU Data Management Services | Johns Hopkins University Sheridan Libraries | [email protected] Data Pilot project with two, graduating doctoral students Biomedical engineering field. Largely image data Data already published, which differs from our usual service model of working with researchers at the beginning of their project JHU Data Archive Used alpha-release of Data Conservancy software [1] Discipline-agnostic and data as primary objects A collection of data may have an associated metadata file, structured or unstructured Not yet publicly-accessible Understanding Research Met with students for initial overview of research Read publications to map data products and activity that created them As shown in Fig. 1, provided a framework to organize data and ensure that all data were included (students could not locate all their data) Organizing Data Completed several in-depth meetings with students Created new folders and subfolders with students present, and moved files to appropriate location Discussed data content, instrument(s) used, and file naming conventions used, if any Experimented with directory structures based on publication figures or research methods. Students and advisor decided that organizing by figure was more useful for data reuse Did not rename files due to time constraints and lack of consistency in filenames Packaging Used BagIt (v. 0.97) and TAR for packaging format Used MD5 checksums for data (payload) and tag files Created a documentation folder for our unstructured metadata (Fig. 2), which we treated as a tag file and not part of the payload One “bag” per publication Unsurprisingly, it is hard for researchers to recall information about their data after a few years. This pilot project reinforced the importance of working with scientists early in their research, which is our usual service model. Due to time constraints and student recollection, our metadata creation was limited to folder and file documentation (Fig. 2). Closely reading and mapping the students' research was central to being able to ask them relevant questions about the data. The BagIt specification worked well for packaging. Future Work This pilot project began the process of formalizing our archiving processes, but we have much more to do! The Data Conservancy software will have improved functionality over the coming years, which has implications for how we evolve the process for archiving. For example, we currently cannot hide deposited data in the JHU Data Archive; however, researchers may want to transfer data to us before their project is complete and ready for public access. We need to develop rigorous processes for ensuring that we maintain the integrity of the data during the often significant alterations required to archive datasets that are useful to others. Figure 1. Example of data flow diagram Figure 2. Example of unstructured metadata. Folder and file documentation Conclusions [1] http://dataconservancy.org/software/ Copyright © 2013, by JHU Data Management Services

description

Betsy Gunia, David Fearon, Benjamin Brosius, Tim DiLauro JHU Data Management Services Johns Hopkins University Sheridan Libraries A Workflow for Depositing to a Research Data Repository: A Case Study for Archiving Publication Data Research Data Access & Preservation Summit 2013 Baltimore, MD April 4, 2013 #rdap13

Transcript of Poster RDAP13: A Workflow for Depositing to a Research Data Repository: A Case Study for Archiving...

Page 1: Poster RDAP13: A Workflow for Depositing to a Research Data Repository: A Case Study for Archiving Publication Data

An Example Workflow for Depositing to a Research Data Repository: A Case Study for Archiving Publication Data

Betsy Gunia, David Fearon, Benjamin Brosius, Tim DiLauro | JHU Data Management Services | Johns Hopkins University Sheridan Libraries | [email protected]

Data •Pilot project with two, graduating doctoral students •Biomedical engineering field. Largely image data •Data already published, which differs from our usual service model of working with researchers at the beginning of their project

JHU Data Archive •Used alpha-release of Data Conservancy software [1] •Discipline-agnostic and data as primary objects •A collection of data may have an associated metadata file, structured or unstructured

•Not yet publicly-accessible

Understanding Research •Met with students for initial overview of research •Read publications to map data products and activity that created them

•As shown in Fig. 1, provided a framework to organize data and ensure that all data were included (students could not locate all their data)

Organizing Data •Completed several in-depth meetings with students •Created new folders and subfolders with students present, and moved files to appropriate location

•Discussed data content, instrument(s) used, and file naming conventions used, if any

•Experimented with directory structures based on publication figures or research methods. Students and advisor decided that organizing by figure was more useful for data reuse

•Did not rename files due to time constraints and lack of consistency in filenames

Packaging •Used BagIt (v. 0.97) and TAR for packaging format •Used MD5 checksums for data (payload) and tag files •Created a documentation folder for our unstructured metadata (Fig. 2), which we treated as a tag file and not part of the payload

•One “bag” per publication

•Unsurprisingly, it is hard for researchers to recall information about their data after a few years. This pilot project reinforced the importance of working with scientists early in their research, which is our usual service model.

•Due to time constraints and student recollection, our metadata creation was limited to folder and file documentation (Fig. 2).

•Closely reading and mapping the students' research was central to being able to ask them relevant questions about the data.

•The BagIt specification worked well for packaging.

Future Work This pilot project began the process of formalizing our archiving processes, but we have much more to do! The Data Conservancy software will have improved functionality over the coming years, which has implications for how we evolve the process for archiving. For example, we currently cannot hide deposited data in the JHU Data Archive; however, researchers may want to transfer data to us before their project is complete and ready for public access. We need to develop rigorous processes for ensuring that we maintain the integrity of the data during the often significant alterations required to archive datasets that are useful to others.

Figure 1. Example of data flow diagram Figure 2. Example of unstructured metadata. Folder and file documentation

Conclusions

[1] http://dataconservancy.org/software/ Copyright © 2013, by JHU Data Management Services