Publishing Data Workflows RDA Plenary 5 -- March 11, 2015 Session Chairs: Amy Nurnberger and Mary...

Post on 25-Dec-2015

213 views 0 download

Transcript of Publishing Data Workflows RDA Plenary 5 -- March 11, 2015 Session Chairs: Amy Nurnberger and Mary...

Publishing Data Workflows

RDA Plenary 5 -- March 11, 2015

Session Chairs: Amy Nurnberger and Mary Vardigan

Please sign in: http://bit.ly/1Hju0LM

Amy L. Nurnberger
Hi Mary, this list doesn't seem to match all of the organizations in https://docs.google.com/a/apps.cul.columbia.edu/spreadsheets/d/1PvsF1UYXuMsXaCarojgLdynQRviSeIY7J97ht6GrPcU/edit#gid=0Is this intentional?
Mary Vardigan
Sorry, I don't know...Mary

Agenda• Introduction:

• Objectives • Progress so far• Workflow Examples•Get involved

• Dataverse workflow presentation• SoftwareX workflow presentation• Use case development

Group notes document: http://bit.ly/1MlXysR

The working group members (currently)• Theodora Bloom (BMJ) [CO-CHAIR]• Sünje Dallmeier-Tiessen  (Switzerland,

CERN) [CO-CHAIR]• Elizabeth Newbold (BL) [CO-CHAIR]• Merce Crosas (US, Harvard University)• Michael Diepenbroek (PANGAEA)• Kim Finney (Australia,  AADC)• John Helly (US, UCSD)• Brian Hole (Ubiquity Press, UK)• Varsha Khodiyar (Nature Scientific Data)• Hylke Koers (The Netherlands, Elsevier)• Rebecca Lawrence (UK, F1000 Research Ltd.)• Fiona Murphy (UK, Wiley-Blackwell)

Others are very welcome ☺

• Amy Nurnberger (US, Columbia University Libraries)

• Lisa Raymond (US, Library Woods Hole Oceanographic Institution)

• Johanna Schwarz (Germany, Springer)•Jonathan Tedds (UK, University of Leicester) •Mary Vardigan (US, ICPSR)•Ruth Wilson (UK, Nature)•Eva Zanzerkia (US, NSF)•Angus Whyte (UK, DCC)

•And growing…

Background and Motivation• Only a small fraction of research data is preserved and shared, often with

a bare minimum of metadata

• Often due to the lack of “established” or “trusted” services and workflows

But there are established or emerging workflows!

• Usually in selected disciplines, e.g., Earth Sciences

• Some provide credit via citation mechanisms

Objectives• Provide an analysis of a representative range of existing

and emerging workflows and standards for data publishing • Including deposit and citation • Provide reference models, a “classification”

• Test implementations of key components for application in new workflows

• Illustrate the benefits of the reference models for researchers and organisations

Relevance• Information about workflows crucial for researchers and

other stakeholders to understand the options available to practice open science

• Helps to illustrate different possibilities for data sharing, leading to more efficient and reliable reuse of research data

• Shows those involved in research data where they fit in the overall scheme of things

More detailed work programme• Identification of a smaller set of reference models covering a range of such

workflows to include:• For example, when and where QA/QC and data peer-review fit into the

publishing process • Who does what and when…• Automated vs. “manual” processes

• Selection of key use cases and organizations in which components of a reference model can be implemented and tested for suitability• For example: dedicated data peer review• For example: metadata checks

First results of workflow analysis

http://tinyurl.com/mvtbrek

Workflows in the current list- STFC Data centre- NSIDC Data centre- ENVRI reference model- OJS/ Dataverse- INSPIRE Digital library- NPG (PubChem & Scientific Data) Publisher- UK Data Archive/Service- PREPARDE (NCAR CISL)- Ocean Data Publication Cookbook (UNESCO IOC)- PURR Institutional repository- ICPSR- Edinburgh Datashare- F1000 Research

- Ubiquity Press: Open Health Data Journal+...- PANGAEA - Data Publisher for Earth and Environmental Sciences- WDC Climate - Data Publisher for Climate Sciences- CMIP / IPCC DDC - International project series in Climate Sciences- GigaScience- Dryad digital repository with integrated journals workflow- Stanford Digital Repository- Academic Commons: Columbia University Institutional Research Repository- Elsevier: Data in Brief- Integrated data publishing solution at Elsevier [through “traditional” journals]

Categories we are looking at • Discipline• Function of workflow• PID assignment to dataset• PID type -- e.g., DOI, ARK, etc.• Peer review of data (e.g., by researcher & editorial review)• Curatorial review of metadata (e.g., by institutional or subject repository?)• Technical review & checks (e.g., for data integrity at repository/data centre on ingest)• Discoverability: Indexing of the data -- if yes, where? • Formats covered• Persons/Roles involved, e.g., editor, publisher, data repository manager, etc.• Link to data paper or “standalone” data• Links to grants, usage of author PIDs• Data citation facilitated• Data life cycle referred to• Standards compliance

Observations• The researcher/author generally initiates the workflow • Discipline-specific repositories have the most rigorous ingest and

review processes -- more general institutional repositories have a lighter touch

• Journals vs. repositories: For the former, any peer review is conducted externally, for many of the latter it is internal

Repository view

Data Deposit

Ingest

QualityAssurance

Data ManagementLT Archiving

DisseminationAccess

Producer Consumer/Reuse

Simplified generic repository workflow

Researcher with a central role: submission/deposition

Review/QA mainly internal

Data Deposit

Ingest

QualityAssurance

LightData

ManagementLT Archiving

DisseminationAccess

Producer

Consumer(disciplinary)

Ingest

QualityAssuranceDetailed

Project Repositories:• Data are published in a federated

data infrastructure • Data are added and corrected • Poor documentation• Usually no data backup• Light-weight quality assurance

against intl. and project standards• Tendency that the project data

never become stable• Currently no PIDs assigned or

reserved but Handles planned

Long-term Archive:

• Data are archived for the long term at asingle location

• Data are stable and curated• Detailed documentation• Data backup/redundancy • Quality assurance process is more

detailed and includes a review• Data is a “snapshot” of the project

data at a certain time• DOIs assigned to data collections

Consumer(interdisciplinary)

DisseminationAccess

Designed byM. Stockhause

Lessons Learnt and questions• Very diverse landscape• Discipline-specific and cross-discipline actions• Quality assurance a big topic in discipline-specific

repositories• Widespread persistent identification• Data citation awareness• Challenge: Bidirectional data-publication linking• Challenge: Versioning

Publisher’s perspective

Article preparation

Data Submission

Article submission

Peer Review Process EditingProducer Consumer/

Reuse

Simplified generic publisher workflow

Researcher takes over several roles: submitter, reviewer, editor potentially?

Who takes on which role and responsibility?

- Article/data container

- Separate article and datasets

Publishing

Example: Dryad repository integrated with journals

Lessons learnt and questions• Recommended repositories for collaboration? Who

decides/how?• External review

• Open, plus invitation• Closed, upon invitation• Blind

•Emerging data and software journal landscape: no information yet on uptake

Current and future work

How to get involved• Contribute to the workflow analysis: http://bit.ly/1BBQQPW• Contribute your own workflow “walk-throughs” and use cases• Tell us what is needed for a “successful” workflow in your

institute/discipline

… Moving to implementation• Tell us if you are interested to learn from a specific example or are

maybe considering implementing data publishing workflows• Tell us if you have code/documentation to share

Break for presentationsDataverse: Eleni CastroSoftwareX: Hylke Koers

DATA PUBLISHING WORKFLOWS WITH DATAVERSE

Eleni Castro (ecastro@fas.harvard.edu)

Institute for Quantitative Social Science (IQSS)

Harvard University

RDA 5th PlenaryWG RDA/WDS Publishing Data Workflows March 11, 2015

An Integrated & AutomatedJournal / Data Publishing Workflow

25

Journal

Repository

Current Workflows in Dataverse: To Connect Data to Journals

A. Journals include Dataverse as a Recommended Repository

B. Authors Contribute Directly to a Journal’s Dataverse

C. Automated Integration of Journal + Dataverse (e.g., OJS)

26

Example of Option C: Phase 1OJS / Dataverse Integration

Integrating Open Journal Systems (OJS) with Dataverse Reference Implementation: Automated via SWORD API

Pilot with ~ 50 journals + expand to 1000s using OJS. Dataverse plugin is automatically available w/ OJS. Future: Embed Dataverse widgets into journal article.

http://projects.iq.harvard.edu/ojs-dvn

27

Project Details: 2012-2014

In the Backend: Technical Workflow

Client sends:

XML file: AtomPub "entry” with Dublin Core Terms (e.g., title, creator, isReferencedBy (article citation), …)

Zip file: All data files associated with that dataset.

Repository sends:

XML file: “Deposit Receipt” send data citation from repository to client.

Plus updates from client to server during lifecycle (CRUD): In review, reject (delete), publish first version, update new versions.

28

On the Frontend: OJS Dataverse Plugin Walkthrough

29

Journal Manager Sets Up Plugin in OJS

30

Journal Manager Sets Up Data Policies

Read full Data Policies / Guidelines Template: http://bit.ly/1xkLjoZ

Including Guidelines for:1) Authors (data citation)2) Reviewers3) Copyeditors

31

Author Submits Manuscript + Data (1)

32

Author Submits Manuscript + Data (2)

Option to: (a) deposit into Dataverse OR; (b) if data is already in a repository can include the data citation (w/ persistent URL/identifier).

33

To-Do: Support for adding multiple datasets to a journal article.

Editor Reviews Article + Data34

Approved = Data Published in Dataverse

When issue is published:1) URL to Article displays in Dataverse. 2) Data Citation shows up in OJS Article (see next slide).

35

1

2

Article in OJS: Published w/ Data Citation

36

Video of OJS Dataverse Plugin Demo

37

http://bit.ly/1D1hphu

Phase 2: Expansion of API + Workflows

38

2015-2016 (collaboration w/ Odum Institute)

1. Expand to more journals, publishing systems, & workflows2. Develop Community-Based Repository API Standard:

Work w/ RDA, WDS, Data FAIRport, FORCE11, CODATA, etc…

Should we extend the Repository API beyond SWORD? Support for additional Metadata Schemas & fields (non-DC)? Support for more/which dataset review workflows?

Project Goals

Project Questions

How Do I Get Involved?

39

Sign up to Contribute: Repositories Workshop + Dataverse Community Meeting June 9-11, 2015 @ Harvard http://bit.ly/1A51atJ

Find Out More: * Visit our Collaborations page: http://bit.ly/1Bg2nkw * Dataverse Project Site: http://dataverse.org

Contact Project Coordinator: Eleni Castro (ecastro@fas.harvard.edu)

1

2

3

Thank You! Any Questions?

40

Contact Me: Eleni Castro (ecastro@fas.harvard.edu)

Hylke Koers, Head of Content Innovation, Elsevier

RDA Plenary 5, San Diego

SoftwareX – a home for research software

| 42Open Access

Software (like data) is high-value but hard to access

Researcher survey, 3824 respondents(Publishing Research Consortium, 2010)

Importance of access

Eas

e o

f ac

ces

s

High value & easy access

High value & difficult to access

| 43Open Access

• Many scholars develop software , but current paper based system does not capture this “born digital” research output systematically

• Users (readers) can’t find this valuable content • Developers (authors) can’t claim credit • Software is a research method in its own right –

and deserved to receive full academic recognition

Why SoftwareX?

| 44Open Access

SoftwareX: a home for research software

SoftwareX aims to acknowledge the impact of software on today'sresearch practice, and on new scientific discoveries in almost allresearch domains. SoftwareX also aims to stress the importance ofthe software developers who are, in part, responsible for this impact.

To this end, SoftwareX aims to support publication of research software in such a way that:• The software is provided with a peer-reviewed recognition of scientific impact• The software developers are given the academic credit they deserve;• The software is citable, allowing traditional metrics of scientific excellence to

apply;• The academic career paths of software developers are supported rather than

hindered;• The software is publicly available for inspection, validation, and re-use.

Above all, SoftwareX aims to inform researchers about software applications, tools and libraries with a (proven) potential to impact the process of scientific discovery in various domains

From “Aims & Scope”, see http://www.journals.elsevier.com/softwarex

| 45Open Access

SoftwareX: a home for research software

• Publishing “Original Software Publications”:- The software and code can include post publication updates- Metadata is systematically captured

• Article is Open Access under CC-BY license• All software and code published is, and will remain, fully owned by

their developers.• Peer-reviewed; dedicated software Editors & Reviewers• Multi-disciplinary• Submission in 3 easy steps• GitHub repository to store and expose all software and code• Launched at FORCE15

See http://www.journals.elsevier.com/softwarex/news/you-can-now-submit-your-software-to-softwarex/

| 46Open Access

How does it work?

How to submit your software to SoftwareX in 3 easy steps:

1. Select a repository for your software or pack your software into a zip file or archive. Remember to make your software public so that the reviewers and readers can find it.

2. Download the template for the OSP manuscript, and write your article describing your software following this template.

3. Submit your OSP manuscript via the SoftwareX submission site.

After review and acceptance, software and/or code will be copied to the journal archive on GitHub and integrated with the online version of your Original Software Publication available on ScienceDirect.

See http://www.journals.elsevier.com/softwarex

| 47Open Access

Template contains structured metadata

Nr Code metadata description Please fill in this column

C1 Current code version For example v42

C2 Permanent link to code/repository used of this code version

For example: https://github.com/mozart/mozart2

C3 Legal Code License List one of the approved licenses

C4 Code versioning system used For example svn, git, mercurial, etc. put none if none

C5 Software code languages, tools, and services used

For example C++, python, r, MPI, OpenCL, etc.

C6 Compilation requirements, operating environments & dependencies

C7 If available Link to developer documentation/manual

For example: http://mozart.github.io/documentation/

C8 Support email for questions

| 48Open Access

Template contains structured metadata

Nr (Executable) software metadata description

Please fill in this column

S1 Current software version for example 1.1, 2.4 etc.

S2 Permanent link to executables of this version

For example: https://github.com/combogenomics/DuctApe/releases/tag/DuctApe-0.16.4

S3 Legal Software License List one of the approved licenses

S4 Computing platforms/Operating Systems For example Android, BSD, iOS, Linux, OS X, Microsoft Windows, Unix-like , IBM z/OS, distributed/web based etc.

S5 Installation requirements & dependencies

S6 If available, link to user manual - if formally published include a reference to the publication in the reference list

For example: http://mozart.github.io/documentation/

S7 Support email for questions

| 49Open Access

Flexible range of open-source licenses for computer code

• Apache License, 2.0 (Apache-2.0)• BSD 3-Clause "New" or "Revised" license (BSD-3-Clause)• BSD 3-Clause "Simplified" or "FreeBSD" license (BSD-2-Clause)• GNU General Public License (GPL)• GNU Library or "Lesser" General Public License (LGPL)• MIT license (MIT)• Mozilla Public License 2.0 (MPL-2.0)• Common Development and Distribution License (CDDL-1.0)• Eclipse Public License (EPL-1.0)• Creative Commons Zero (CC0)

| 50Open Access

And now.. The moment you have all been waiting for…

| 51Open Access

A workflow diagram

Researcher has code and paper

Submits to journal as OSP + code

(supp. mat.)

Editorial + peer-review process

Code made available on journal

GitHub instance

Bi-directional links

OSP published on ScienceDirect

| 52Open Access

A workflow diagramEditorial + peer-review process

Code made available on journal

GitHub instance

Bi-directional links

OSP published on ScienceDirect

Code deposited to (or build on) code repository

OSP submitted to journal

OSP linked with code

| 53Open Access

Thank you!

Any questions?

Discussion

Use case development

Developing use cases for workflows●The tools

○ Part A: http://goo.gl/forms/Wkc7KyxvX5○ Part B: http://goo.gl/forms/ZFRrzG6krX

●The process○ Walk through the tools○ Form up in groups○ Generate use cases

The tools: Part A http://goo.gl/forms/Wkc7KyxvX5

The tools: Part A http://goo.gl/forms/Wkc7KyxvX5

The tools: Part A http://goo.gl/forms/Wkc7KyxvX5

The tools: Part A http://goo.gl/forms/Wkc7KyxvX5

The tools: Part A http://goo.gl/forms/Wkc7KyxvX5

Thank you! You have completed Part A of this use case. For the next part, you will be completing multiples of a form, to address each individual actor listed in this use case. Click this to get to Part B: http://goo.gl/forms/ZFRrzG6krX

The tools: Part B http://goo.gl/forms/ZFRrzG6krX

The tools: Part B http://goo.gl/forms/ZFRrzG6krX

The tools: Part B http://goo.gl/forms/ZFRrzG6krX

The tools: Part B http://goo.gl/forms/ZFRrzG6krX

Group up!

●The tools○ Part A: http://goo.gl/forms/Wkc7KyxvX5○ Part B: http://goo.gl/forms/ZFRrzG6krX