Data Publishing Models by Sünje Dallmeier-Tiessen

Data Publishing Models

Sünje Dallmeier-Tiessen, PhD CERN, Harvard University

For the RDA-WDS Data Publishing Workflow Group

June 9th, 2015

Topics

• What is data publishing • Why do we care about it (today) • Models in data publishing • Building blocks • Information gathered through trusted data publishing • Relevance and conclusions for today’s workshop

This is work conducted by the RDA-WDS group on data publishing workflows, chaired in collaboration with Fiona Murphy and Theo Bloom.

Data Publishing … describes the process of making research data and other research objects available on the web so that they can be discovered and referred to in a unique and persistent way. At its best, data publishing takes place through dedicated data repositories and data journals and ensures that the published research objects are well documented, curated, archived for the long term, interoperable, citable and quality assured. Thus, they are reusable and discoverable on the long term.

Presenter

Presentation Notes

So, potentially, with trusted data publishing we would have very rich information about the data and related objects. Just the question how we could exchange that

Examples

Presenter

Presentation Notes

Screenshots: Dataverse, Dryad, Nature Scientific Data, ESSD, Software X,

Analysis elements • Discipline, responsible units (i.e. their roles) • Function of workflow • PID assignment: DOI, ARK, etc. • Peer review of data (e.g. by researcher & editorial review) • Curatorial review of metadata (e.g. by institutional or subject repository?) • Technical review & checks (e.g. for data integrity at repository upon

ingestion) • Formats covered • Persons/Roles involved, e.g. editor, publisher, data repository manager,

etc. • Links to additional data products (data paper; review documents; other

journal articles) or “stand-alone” product • Links to grants, usage of author PIDs • Discoverability: Indexing of the data -- if yes, where? • Data citation facilitated • Data life cycle reference • Standards compliance

Repository’s perspective

Data Deposit

Ingest

Quality Assurance

Data Management LT Archiving

Dissemination Access

Producer Consumer/ Reuse

Simplified generic repository workflow Researcher with a central role during submission/deposition

Review/QA mainly internal through dedicated curation personnel

Data Deposit

Ingest

Quality Assurance

Light Data

Management LT Archiving


Producer

Consumer (disciplinary)

Ingest

Quality Assurance Detailed

Project Repositories: • Data are published in a federated

data infrastructure • Data are added and corrected • Poor documentation • Usually no data backup • Light-weight quality assurance

against intl. and project standards • Tendency that the project data

never become stable • Currently no PIDs assigned or

reserved but Handles planned

Long-term Archive:

• Data are archived for the long term at a single location

• Data are stable and curated • Detailed documentation • Data backup/redundancy • Quality assurance process is more

detailed and includes a review • Data is a “snapshot” of the project

data at a certain time • DOIs assigned to data collections

Consumer (interdisciplinary)


Content provided by M. Stockhause

Disciplinary repository example

Presenter

Presentation Notes

In our case the project repositories are multiple data nodes in a federated data infrastructure (ESGF: http://esgf.llnl.gov/ ). The long-term archive collects data from the different project repositories and sometimes independent metadata repositories and enriches the metadata (e.g. by project descriptions, experiment details, contacts, etc.) before long-term archival.

Lessons learnt and questions • Very diverse landscape • Discipline-specific and cross-discipline actions • Quality assurance a big topic in discipline-specific

repositories • Widespread persistent identification • Data citation awareness • Challenge: Versioning

Publisher’s perspective

Article preparation

Data Submission

Article submission

Peer Review Process Editing Producer Consumer/

Reuse

Simplified generic publisher workflow Researcher takes over several roles: submitter, reviewer, editor potentially?

- Article/data container

- Separate article and datasets

Publishing

Data repositories

Example Workflows in Dataverse: Connect Data to Journals

A. Journals include Dataverse as a Recommended Repository

B. Authors Contribute Directly to a Journal’s Dataverse

C. Automated Integration of Journal + Dataverse (e.g., OJS)

Slide by Eleni Castro

Example: Dryad repository integrated with journals

Slide by T. Bloom

Data publishing building blocks

Primary data entry with PID

Repository entry

Metadata

Curation

Parallel data description

Data Paper or link to it

Link to results paper

Linked and published quality

assurance

Curation, Editing process

Peer review

Any kind of QA process

Additional visibility

Push to ORCID, author

pages, impact/reputation building

tools

Enable index (Data citation index, crawled

by Google)

Basic published product

Add-ons: workflows for more documentation, QA, visibility

Trusted data publishing contains:

• Standardized information about the data – Disciplinary standards – Basic common metadata sets

• Distinct Roles, Workflows and Responsibilities – Authorship, Submission – Curation – Quality Assurance – Peer review

• Persistent Identification – Permanent reference – Data citation

Challenges

• Interoperability challenges – Different metadata schemas – Rich vs. limited metadata

• Discoverability challenges – E.g. no bi-directional linking – Usability challenges in aggregators

• Metrics and accreditation • What information is needed for future

reuse/remix/reproducibility • How can this information be exposed – human

and machine readable

Thank you!

Data Publishing Workflows

Activities and processes in a digital environment that lead to the publication of research data and other research objects on the Web. These activities may be performed by humans or in an automated fashion. In contrast to the interim or final published products, workflows are the means to curate, document, peer review and thus ensure and enhance the value of the published product.

Presenter

Presentation Notes

Copy

Data Publishing Models by Sünje Dallmeier-Tiessen

Education

Transcript of Data Publishing Models by Sünje Dallmeier-Tiessen