FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks and Off the Shelf...

Post on 12-Jan-2017

236 views 1 download

Transcript of FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks and Off the Shelf...

FAIR Software (and Data) Citation: Europe, Research Object Systems,

Networks and Off the Shelf Infrastructure

Professor Carole GobleThe University of Manchester, UK

Software Sustainability Institute UKELIXIR-UK, ELIXIR Interop Platform

carole.goble@manchester.ac.ukOrcid 0000-0003-1219-2137

NSF Workshop Data and Software Citation, 6-7 June 2016, Boston USA, Software Panel

AcknowledgementsU Manchester• Stian Soiland-Reyes• Stuart Owen• Caroline Jay• Robert Haines• Norman MorrisonU Newcastle• Paolo MissierU Illinois Urbana-Champaign• Dan KatzMurphy Mitchell Consulting Ltd• Fiona MurphyF1000• Liz AllenU Oxford• Neil Jefferies• Lucie BurgessISI, USC• Yolanda Gil• Daniel Garijo

Force11 DCIP / Harvard• Tim ClarkELIXIR / BioSchemas.org• Rafael Jimenez (Hub)• Niall Beard (ELIXIR UK)• Aleks Nenadic (ELIXIR UK)• Jo McEntyre (EBI, THOR)NIH BD2K • Susanna Sansone (bioCADDIE,

ELIXIR)• Ian Fore (NIH)Software Sustainability Institute• Shoaib Sufi • Neil Chue Hong • Mike Jackson STFC• Catherine Jones

Chief Contexts

Workflow Repository

Systems and Synthetic Biology Projects

FAIRFindable

Accessible

Interoperable

ReusableIntelligible

Reproducible

Citable

Track & Countable

Findable

Accessible

Interoperable

ReusableIntelligible

Change

Citable

Track & Countable

FAIR Credit

sciencecodemanifesto.org

http://www.elixir-europe.org/

17 ELIXIR members2 observers

major bioinformaticsservice providers (~150)

Co-operation Long term support

ob

Germany

ob

Data Citation in Europe PMC full text

http://dliservice.research-infrastructures.eu/#/

https://www.openaire.eu/

https://www.rd-alliance.org/groups/rdawds-publishing-data-services-wg.html

European Open Science Cloud

Technical and Human infrastructure for

Open Research

• interoperability and integration between ORCID and DataCite infrastructures

• PID e-infrastructure: promote uptake and sustain

https://project-thor.eu/

Giving Researchers Credit for their Data

https://www.jisc.ac.uk/rd/projects/research-data-spring

• Carrots for authors, ”pain-free” submission• Helper app for submitting data papers and

data for papers (using DataCite and ORCID)

http://www.software.ac.uk/software-credit

Over 90 GuidesWar StoriesPolicy, Supporthttp://www.software.ac.uk/software-management-plans

digital curation centrehttp://dcc.ac.uk

http://openresearchsoftware.metajnl.com/

http://www.software.ac.uk/how-cite-and-describe-software

Mike Jackson

http://rse.ac.uk

Not all creditable software is a “downloadable application”

Registration is hit and miss

Metrics Indica

tors

Counts

Community Smarts

Software Citation Space

Science as a Service

Open Source Codes

Virtual Machines

Portable Packaging

Libraries

Applications

Scripting environments

Infrastructure

Commercial tools

Scripts /

Workflows

Packages GEMS

Dynamic Deployments

Reproducible Research: Citing your execution environment using Docker and a DOI

http://www.software.ac.uk/blog/2016-03-29-reproducible-research-citing-your-execution-environment-using-docker-and-doi

+ +Caroline Jay, Robert Haines

http://idinteraction.cs.manchester.ac.uk‘ABC: Using Object Tracking to Automate Behavioural Coding.’ CHI 2016.

=FixityPublishing

Service vs ScienceBackground vs Foreground Software

Software and Data* in foreground most likely cited. Same software and data viewed as background not or not explicitly cited though equally essential

* Wynholds, et al (2012) Data, data use, and scientific inquiry: two case studies of data practices 10.1145/2232817.2232822

The invisibility of software, esp:• widely used• infrastructural• component/library• cross-discipline

Credit DriftImmediate

teamBackground

team

“Foreground”software

Authorship Authorship?

Cited?Acknowledged

Cited?Mentioned

Ignored“Background”

software

Cited

Transitive, Fractional CreditNot all software is equal

* Wynholds, et al (2012) Data, data use, and scientific inquiry: two case studies of data practices 10.1145/2232817.2232822

https://mr-c.github.io/shouldacitehttp://bit.ly/shouldacite

SSI Collaborations Workshop 2016

Should I cite the software?

Overcoming Barriers to Software Citationsurvey of experiences citing software in research

publications

http://bit.ly/1WxWFY7

Caroline Jay, Robert Haines, University of Manchester, UKRobin Wilson, University of Southampton, UK

System Biology Projects Common

s

http://fair-dom.org

Systems and Synthetic Biology ProjectsLinking, “Packaging” &

Citing Codes, Data, Models,

SOPs, Samples, Strains, Articles, People,

Projects….

Repository spanning catalogue, reference (“cite”) distributed 3rd party content

Standards

Public data archives

Project data repositories

Literature archives

Public model archives

Uploaded content Plugin Model

tools

FAIR

DO

M

Plugin Data tools

Structured Metadata Capture

metadata sheets sample sheets

data sheets

http://www.rightfield.org.uk

[Martin Scharm, Rostock University]

Haus et al, BMC Systems Biology, 2011, 5:10Solvent production by Clostridium acetobutylicum

https://dx.doi.org/10.1111/febs.13237

https://doi.org/10.15490/seek.1.investigation.56

http://data.datacite.org/10.15490/seek.1.investigation.56

Citation G. Penkler; F. du Toit; W. Adams; M. Rautenbach; D. C. Palm; D. D. van Niekerk; J. L. Snoep; (2014): Glucose metabolism in Plasmodium falciparum trophozoites; FAIRDOMHub. http://dx.doi.org/10.15490/seek.1.investigation.56

Fixity Publishing, URIs -> DOIs

"Mapping present and future predicted distribution patterns for a meso-grazer guild in the Baltic Sea" Sonja Leidenberger et al

CreditsAttributions

In Multiple Packs

Track?

Workflows

Pointer to 3rd Party Data Collection

Pointer to 3rd Party Code

Local files

• Aggregated• Granularity• Atomicity / Subsets• Recombined• Distributed• Dynamic and versioned

• Multi-contributors• Spans resources• Independently stewarded• Shift and change

Content Contribution

• Metadata Framework: Bundles and relate multi-hosted scattered digital resources of a scientific experiment or investigation using standard mechanisms

• Exchange, Publishing, Reproducibility, Portability, Repair

See Stephen Abrams Talk yesterday

Datasets, Data collectionsStandard operating proceduresSoftware, algorithmsConfigurations, Tools and apps, services

Slide

share

Github

figsh

are

Commun

ityDB

Arxiv.o

rg

Pubm

ed

Docke

rim

age

Codes, code librariesWorkflows, scriptsSystem software Infrastructure Compilers, hardware

Input Data

WorkflowDescripti

on

Provenance

trace

Version of

Codes / Services

Output

Manifest Constructi

on

Manifest

Identificationto locate things

Aggregates to link things together

Annotations about things & their

relationships

Container

Metadata Objects Citable Reproducible Packaging

Manifest Descripti

on Type Checklists what should be thereProvenance where it came fromVersioning its evolutionDependencies what else is needed

Manifest

Packaging content & links: Zip files, BagIt, Docker

images

Catalogues & Commons Platforms: FAIRDOM SEEK, STELAR eLab

OAI

ORE

W3C

OADM

RO Types: Manifest Content Profilesminimal, maximal, extensible

PIDCitation

Checklist

Version

Prov

enan

ce

Dependencies

JATSComms

DC DCAT

Exp

ISAEFODomain

SBMLMIAME CWL

Common properties

among content types

Minimum information

for one content type

Workflow RO BundleZIP or BagIt folder structure

Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects, J Web Semantics doi:10.1016/j.websem.2015.01.003

application/vnd.wf4ever.robundle+zip

JSON and YAML

Persistent Identification of Software: a building block to citation & curation

Catherine.jones@stfc.ac.uk B. Matthews, I. Gent, J. Tedds & S LamertonProject URL http://rrr.cs.st-andrews.ac.uk/

Guidelines for persistently identifying software using DataCite

https://epubs.stfc.ac.uk/work/24058274

• Most recent?– Location indicator, crosslink– Credit the contributors now, the version now– Strong presumption it exists and is living

• Fixed Snapshot?– Defend publication, Reuse – Credit the contributors then, the version then– Presumption it exists and is archived

• Line in the sand?– Credit the contributors then, the version then– Weak presumption it exists

• Warrant?• Acknowledgement not contribution• Don’t care if it exists• Important “influence” citation for its contributors

What does the citation meanfor the author or reader?

Identifier Resolution, Citation Persistence, Content Decay?

Commons

my Disk

Commons

• DOI proliferation– Channelling for Counting and

Landing Pages

• Authenticity: Tamper-proof Exchange and Provenance– Hashing & Checksums – Secure signature & probity

services– Block chain

• anti tampering transaction logging

• https://www.ethereum.org/– Proll and Rauber, Scalable

data citation in dynamic, large databases: Model and reference implementation, (2014) 10.1109/BigData.2013.6691588

• Uber Collection / Hierarchy / subsetting (cf. Dryad, DataONE, DataVerse)*

• RO author/contributor information in its manifest

• ROs manifest => constituent resources, provenance for contribution.

*Ball, A. & Duke, M. (2011). "How to Cite Datasets and Link to Publications?". DCC How-to Guides. Edinburgh: Digital Curation Centre. http://www.dcc.ac.uk/resources/how-guides/cite-datasets.

Granularity Atomicity

Aggregation

Robust Transitivity & PropagationCitation and Credit Aggregation and Granularity

• Backward Citation– What was this based

on, who did it?• Forward Citation

– What is using this, who did that?

• “PageRank”

Credit Aggregation

Citation GranularityDrift

D. S. Katz, "Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products," Journal of Open Research Software, v.2(1): e20, pp. 1-4, 2014. DOI: 10.5334/jors.be

1

3

2

2

34

11

1

2

25

3

3

4

3

Who gets credit for what?

Using Provenance for Credit Mapping

Paolo Missier

Alice

Charlie

Bob

Paolo Missier, Data Trajectories: tracking reuse of published data for transitive credit attribution, IDCC 2016

W3C PROVdependency graph

“Provlets”

• Tracking RO usage and indirect contributions

• Awarding fractional credit to contributors

1. “Contriponents” • contributors +

components2. Weighted contribution3. Networked Credit maps

• Travel with the contriponents

Transitive Credit contributionDan Katz and Arfon Smith

*Katz, D.S. & Smith, A.M., (2015). Transitive Credit and JSON-LD. Journal of Open Research Software. 3(1), p.e7, DOI: http://doi.org/10.5334/jors.by

D. S. Katz, "Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products," Journal of Open Research Software, v.2(1): e20, pp. 1-4, 2014. DOI: 10.5334/jors.be

How do we weight and

track ?

https://www.refme.com/uk/

http://depsy.org/

• Literature mining– Duck et al Ambiguity and

variability of database and software names in bioinformatics (2015) DOI: 10.1186/s13326-015-0026-0

• Infrastructure– Identifier and provenance

infrastructure, dependency managers, metrics services, repositories, machine readable and processable metadata, reference managers

• CReDIT – contributor taxonomy– http://casrai.org/CRediT– Time for revision?

http://mdc.lagotto.io/

http://ivory.idyll.org/blog/2015-authorship-on-software-papers.html

Find | Cite | CreditRamps “Riding the metadata COTS-tails”

• 3rd of web pages• Opening out -> community groups and extensions• Builds on a shared core and data structure• Simple embedding in web pages and CMS• Widespread tooling, harvesters and indexing• Search engines and Integration tools• It’s all about the metadata and knowledge graph

Google, Bing, Yahoo, Yandex

Find | Cite | CreditRamps “Riding the metadata COTS-tails”

DepthDATS

Reach

http://codemeta.github.io/

http://ontosoft.org/

Find | Cite | Credit Ramps “Riding the metadata COTS-tails”

Reach

Depth

Bioschemas.org

Specification

Data model

Minimum information

Controlled vocabularies

Cardinality

Documentation

Examples

New (properties | types)

Restrictions

Constraints

Extensions

BioSchemas.orgminimal, maximal, extensible

Trainingmaterials

Events Organizations

Data

Standards

Software

Minimum information

for one content type

Trainingmaterials

Events Organizations

DataSoftware

Standards

Common properties

among content types

Identifier, Title, Description, Author, Topics, Audience, Publication Date, …

Schema.orgBioSchemas.org, W3C FHIR WG

Daniel Mietchen et al , Adapting JATS to support data citation, Journal Article Tag Suite Conference (JATS-Con) Proceedings 2015, Bethesda (MD): National Center for Biotechnology Information 2015.

Journal Article Tag Suite

DATS

SoftwareSourceCode

• Stretch in all directions– Granularity, Atomicity, Aggregation– Only partially automatable

• Dynamic Citation – “Citable Units” – Buneman et al, https://tinyurl.com/bdf-cacm

• ROs & Contriponents– Standardised metadata manifests – Tracking fabrics– Distributed => will break

• Keep it simple– Incremental, Commodity based, Low Tech– Guidelines & Conventions– Ramps – like Bioschemas.org– Capture metadata all along the way….

Open Questions?

Getting folks (authors, reviewers, editors) to cite software and data

For Further Information• http://www.researchobject.org• http://www.wf4ever-project.org• http://www.fair-dom.org• http://seek4science.org• http://www.software.ac.uk• http://www.bioschemas.org• http://codemeta.github.io/• http://myexperiment.org• http://www.commonwl.org/

EXTRAS

unshown