Goble keynote vivo-scits2014

78
Research Objects for FAIRer Science Professor Carole Goble CBE FREng FBCS The University of Manchester, UK [email protected] VIVO/SciTS Conferences 6-8 August 2014, Austin, TX

description

Research Objects for FAIRer Science - Shared Keynote presentation at VIVO and Science of Team Science Joint Conference, 6-8 August 2014, Austin Texas

Transcript of Goble keynote vivo-scits2014

Page 1: Goble keynote vivo-scits2014

Research Objects for

FAIRer ScienceProfessor Carole Goble CBE FREng FBCSThe University of Manchester, [email protected]

VIVO/SciTS Conferences 6-8 August 2014, Austin, TX

Page 2: Goble keynote vivo-scits2014

Scientific publications have at least two goals: (i) to announce a result and (ii) to convince readers that the result is correct

…..papers in experimental science should describe the results and provide a clear enough protocol to allow successful repetition and extension

Jill Mesirov Accessible Reproducible Research

Science 22 Jan 2010: 327(5964): 415-416 DOI: 10.1126/science.1179653

Virtual Witnessing*

*Leviathan and the Air-Pump: Hobbes, Boyle, and the Experimental Life (1985) Shapin and Schaffer.

Page 3: Goble keynote vivo-scits2014

Virtual Witnessing*

*Leviathan and the Air-Pump: Hobbes, Boyle, and the Experimental Life (1985) Shapin and Schaffer.

Capturing, representing, sharing the information needed to understand how a research result came about.

Context of results• Inputs, outputs, process…Context of resources• Instruments, data, software,

people…

Page 4: Goble keynote vivo-scits2014

“An article about computational science in a scientific publication is not the scholarship itself, it is merely advertising of the scholarship. The actual scholarship is the complete software development environment, [the complete data] and the complete set of instructions which generated the figures.” David Donoho, “Wavelab and Reproducible Research,” 1995

datasetsdata collectionsstandard operating proceduressoftwarealgorithmsconfigurationstools and appscodesworkflowsscriptscode librariesservices,system software infrastructure, compilershardwareMorin et al Shining Light into Black

BoxesScience 13 April 2012: 336(6078) 159-160

Ince et al The case for open computer programs, Nature 482, 2012

Page 5: Goble keynote vivo-scits2014

“I can’t immediately reproduce the research in my own laboratory. It took an estimated 280 hours for an average user to approximately reproduce the paper.”

Phil BourneNIH Big Wig for Data Science

Page 6: Goble keynote vivo-scits2014

a reproducibility paradox

big, fast,complicated, multi-step, multi-type multi-field

greaterexpectations of reproducibility

diy publishinggreater access

Page 7: Goble keynote vivo-scits2014

Systems Biology Collaborations

Modelling Cycle

45 organisations 112 organisations

Page 8: Goble keynote vivo-scits2014

Data

Models

Articles

ExternalDatabases

http://www.seek4science.org

Metadata

http://www.isatools.org

Ontology-driven Aggregated Content Infrastructure (Framework) for building Sys Bio Commons

share and interlinking multi-stewarded, mixed, methods, models, data, samples…

Standards

DCATFOAF

YellowPages

Page 9: Goble keynote vivo-scits2014

Yellow Pages

Careful Sharing Options

Page 10: Goble keynote vivo-scits2014

Commons

Page 11: Goble keynote vivo-scits2014

Investigations

AssaysStudies

Towards Interoperable Bioscience Data, Nature Genetics, 2012

Standards, Structure, Interlink

Just Enough Results Model for things produced and used in experiments

Page 12: Goble keynote vivo-scits2014

Construction data

Validation data

Metabolomics

Mass Spec

Transcriptomics

Proteomics

Fluxomics

Publications

Mix of locally & remotely hosted content

Open Modelling Exchange Format Archive

Wolstencroft et al, Proc ISWC 2013

Just Enough Results Model for stuff in experimentsCommon elements

Data type specific elements

Page 13: Goble keynote vivo-scits2014

Experimentalists, modellers & developersCross-site, cross project collaborationKnowledge network

Building the System: Building a Cult

TRUST

VISION

SETTING EXPECTATIONS

Drink togetherWork together

Page 14: Goble keynote vivo-scits2014

• Collaboration – Complementarity correlation

• Modellers share more than Experimentalists

• Experimentalists reuse models more than Modellers

• Active enclave sharing • Public sharing tricky even

after publication, bribery and threats

• Data Hugging, Flirting and Voyerism

Page 15: Goble keynote vivo-scits2014

• Playground rules apply• Fluid, transient

collaborations > membership mgt pain in a*se

• Shameless exploitation of PI competitiveness & vanity

• PI & Funder leadership

• Pan project spawned collaborations – YES!!!!

• But not necessarily visible to us.

Page 16: Goble keynote vivo-scits2014

Data discovery

Data assembly, cleaning, and refinement

Ecological Niche Modeling

Statistical analysis

Data collection

InsightsInsights Scholarly Communication & Reporting

Scholarly Communication & Reporting

Enclosed sea problem (Ready et al., 2010)

Pilumnus hirtellus

Scientific Workflows

Page 17: Goble keynote vivo-scits2014

BioSTIF

method

instruments and laboratory

materials

Data discovery

Data assembly, cleaning, and refinement

Ecological Niche Modeling

Statistical analysis

Data collection

InsightsInsights Scholarly Communication & Reporting

Scholarly Communication & Reporting

Method Matters!

Page 18: Goble keynote vivo-scits2014

Workflow Commons

Page 19: Goble keynote vivo-scits2014

"Mapping present and future predicted distribution patterns for a meso-grazer guild in the Baltic Sea" by Sonja Leidenberger et al

Page 20: Goble keynote vivo-scits2014

1st International Workshop on Social Object Networks (SocialObjects 2011), Boston, October 9th 2011.

Find, Click ‘n’ GoFile ‘n’ Forget

Specialist Curators

Page 21: Goble keynote vivo-scits2014

24

Properties What would you ask a publication if you could?

Identity and DescriptionUniquenessAuthenticity

Who are you ? Where and when were you born ? Who were your parents (creators) ?

Review, Reuse, and Repurpose For which purpose were you conceived and have been used ?

InspectionVisualizationAnnotations

What do you have inside ?

Representation How is your content structured ?

Access Rights May I access all your parts ?

Adaptability Which parts can I replace ?

Evolution & VersioningProvenance

What have they done to you ? Who and When ? Why did they do that ?

Quality Why are you relevant to me ? Can I believe what you are saying or trust your results ?

Reproducibility Do you still produce the same results ?

Fitness Are you still working ?How could I repair you ?

Credit and attribution How could I thank you ? How could I talk about you ?

Page 22: Goble keynote vivo-scits2014

From Manuscripts

to “Research Objects”

A meme

The multi-dimensional paper

Packs

Page 23: Goble keynote vivo-scits2014

Packs

www.datafairport.org

Page 24: Goble keynote vivo-scits2014

What is a Research Object?

Page 25: Goble keynote vivo-scits2014

Howard Ratner, STM Innovations Seminar 2012was: Chair STM Future Labs Committee, CEO EVP Nature Publishing Group,

now: Director of Development for CHORUS (Clearinghouse for the Open Research of US)

http://www.youtube.com/watch?v=p-W4iLjLTrQ&list=PLC44A300051D052E5

http://www.myexperiment.org/packs/196.html

Page 26: Goble keynote vivo-scits2014
Page 27: Goble keynote vivo-scits2014
Page 28: Goble keynote vivo-scits2014

What The Commons* Is and Is Not

Is Not:– A database

– Confined to one physical location

– A new large infrastructure

– Owned by any one group

Is:– A conceptual framework

– Analogous to the Internet

– A collaboratory

– A few shared rules• All research objects

have unique identifiers

• All research objects have limited provenance

Philip E. Bourne Ph.D.Associate Director for Data Science, National Institutes of Healthhttp://www.slideshare.net/pebourne

*The NIH BD2K Commons Framework $100million in 2015

Page 29: Goble keynote vivo-scits2014

Social Objects

carriers of discourse

Page 30: Goble keynote vivo-scits2014

http://www.researchobject.org/

A Framework to Bundle and Relate multi-hosted (digital) resources of a scientific experiment or investigation using standard mechanisms & uniform access protocols. Carriers of Research Context

Outputs are first class citizens to be managed, credited and tracked: data, software

Research Objects

Page 31: Goble keynote vivo-scits2014

Links

• Recording & linking together the components of an experiment

• Linking across experiments.

Page 32: Goble keynote vivo-scits2014

Preserve Archive

Reproduce* RecomputeReuseTrain & Explain

Exchange RemixFix

* a word that means many things…..

Page 33: Goble keynote vivo-scits2014

re-compute

replicatererun repeat

re-examine

repurpose

recreate

reuse

restore

reconstructreview

regeneraterevise

recycle

regenerate the figure

redo

Results may vary

Page 34: Goble keynote vivo-scits2014

repeat replicate

Drummond C Replicability is not Reproducibility: Nor is it Good Science, onlinePeng RD, Reproducible Research in Computational Science Science 2 Dec 2011: 1226-1227.

Methods(techniques, algorithms, spec. of the steps)

Materials(datasets, parameters, algorithm seeds)

ExperimentInstruments(codes, services, scripts, underlying libraries)

Laboratory(sw and hw infrastructure, systems software, integrative platforms)

Setup

reusereproduce

Executable Research Object

Page 35: Goble keynote vivo-scits2014

same experimentsame set upsame lab

same experimentsame set updifferent lab

same experimentdifferent set up

different experiment

some of same

Validate

reusereproduce

repeat replicate

http://www.biomedcentral.com/biome/carole-goble-on-reproducible-research-what-it-really-means-how-to-reach-it/

Page 36: Goble keynote vivo-scits2014

Design

Execution

Result Analysis

Collection

Publish / Report

Peer Review

Peer Reuse

Modelling

Can I repeat & defend my method?

Can I review / reproduce and compare my results / method with your results /

method?

Can I review / replicate and certify

your method?

Can I transfer your results into my

research and reuse this method?

* Adapted from Mesirov, J. Accessible Reproducible Research Science 327(5964), 415-416 (2010)

Research Report

Prediction

Monitoring

Cleaning

Page 37: Goble keynote vivo-scits2014

specialist codes libraries, platforms, tools

services

(cloud) hosted services

commodity platforms

data collectionscatalogues software

repositories

my datamy processmy codes

integrative frameworks

gateways

Page 38: Goble keynote vivo-scits2014

data carpentry

http://software-carpentry.org/

Page 39: Goble keynote vivo-scits2014

Components & Dependencies

• 35 kinds of annotations• 5 Main Workflows• 14 Nested Workflows• 25 Scripts• 11 Configuration files• 10 Software

dependencies • 1 Web Service • Dataset: 90 galaxies

observed in 3 bands • Multiple platforms• Multiple systems

José Enrique Ruiz (IAA-CSIC)

Galaxy Luminosity Profiling

Page 40: Goble keynote vivo-scits2014

Executable Instrument Entropy

Zhao, Gomez-Perez, Belhajjame, Klyne, Garcia-Cuesta, Garrido, Hettne, Roos, De Roure and Goble. Why workflows break - Understanding and combating decay in Taverna workflows, 8th Intl Conf e-Science 2012

MitigateDetect, RepairPreserve

Partial replicationApprox. reproductionVerificationBenchmarks

Page 41: Goble keynote vivo-scits2014

Executable Instrument EntropyPrepare to Repair

Reproducibility by InspectionRead It

Reproducibility by InvocationRun It

Document Instrument

Page 42: Goble keynote vivo-scits2014

[Adapted Freire, 2013]

provenancegather

dependenciescapture stepstrack & keep

results

provenancegather

dependenciescapture stepstrack & keep

results

portability

variability tolerance

preservationpackaging

versioning

openaccessibleavailablemachine actionable

descriptionintelligible

machine-readable

Page 43: Goble keynote vivo-scits2014

[Adapted Freire, 2013]

AuthoringExec.

PapersLink docs to

experiment

Sweave

ProvenanceTracking,Versioning

Replay, Record, Repair

Workflows, makefiles

ProvStore

provenancegather dependencies

capture stepstrack & keep results

provenancegather dependencies

capture stepstrack & keep results

openaccessibleavailablemachine actionable

descriptionintelligible

machine-readable

Page 44: Goble keynote vivo-scits2014

[Adapted Freire, 2013]

packagingportability

variability tolerance

preservation

provenancegather dependencies

capture stepstrack & keep results

provenancegather dependencies

capture stepstrack & keep results

versioning

host

service

Open Source/Store

Sci as a Service

Integrative fws

Virtual MachinesRecompute,

limited installation, Black BoxByte execution, copiesDescriptive read,White BoxArchived record

Read & Run, Co-locationNo installation

Portable PackageWhite Box, Installation Archived record

Page 45: Goble keynote vivo-scits2014

[Adapted Freire, 2013]

host

service

ReproZip

packagingportability

variability tolerance

preservation

provenancegather dependencies

capture stepstrack & keep results

provenancegather dependencies

capture stepstrack & keep results

versioning

Page 46: Goble keynote vivo-scits2014

No Green Fields No One System

Find Access Interop ReusePorting across PlatformsExchange between SystemsComparing across Labs

Page 47: Goble keynote vivo-scits2014

Identity

Description

Packaging

Refer to aggregations and their resource contents

Interpretation: What does it mean?How can I compare with others?How is it linked together and linked to others?

Describe aggregation structure and its constituent partsContainer regardless of host

FAIR RO Core Model

manifest

Uniform and first class handling of diverse types (data, software, workflows…)

Page 48: Goble keynote vivo-scits2014

Identity

Annotation

Aggregation

FAIR RO Core ModelDOIs

URIsHandles

ORCID

W3C OAM

OAI-ORE

Open Annotation Model

OAI-Object Reuse and Exchange

Page 49: Goble keynote vivo-scits2014

Identity

Annotation

Aggregation

FAIR RO Core ModelDOIs

URIsHandles

ORCID

AggregationsResource mapsProxies

Annotation first class and stand-off

Identity persistence and resolutionCitation

W3C OAM

OAI-ORE

Page 50: Goble keynote vivo-scits2014

Identity

Annotation

Aggregation

FAIR RO Core PlatformsDOIs

URIsHandles

ORCID

Data Citation Implementation

W3C OAM

OAI-ORE

Page 51: Goble keynote vivo-scits2014

Distributed Third Party Tenancy

Alien Store

AggregationCarrier of Research Context

• Identifiable, citable, resolvable

• Uniform Management• Mixed Stewardship

• Decay & Graceful Degrade

• Content & Aggregation Lifecycles

• Annotations• Manifests, Recipes,

Permissions, Discourse

Aggregations• Dispersed /

Encapsulated• External (linked) /

Local• Mixed types • Blackboxes• Virtual / Materialised

Content Resources• Aggregations

themselves• In many aggregations• Virtual / Materialised• Open / Closed

Page 52: Goble keynote vivo-scits2014

TARDIS: Time and Relative Dimension in SpaceScience

Page 53: Goble keynote vivo-scits2014

RO Model Ontology

Page 54: Goble keynote vivo-scits2014

• RO Management– Transportation / Access / Citation– Id location of RO “container”– Provenance of RO & contents– Behaviour/lifecycle of RO & contents– Policies

• RO Interpretation– What the RO and its content mean– How they can be compared and

validated– How they can be used, executed, linked

• Interpretation variations– Type (e.g. Workflows)– Discipline (e.g. Biology)– Task (e.g. Discovery, Execution)– Activity (e.g. Experiment)

Progression LevelsManagement and Interpretation for Integrated Applications

Page 55: Goble keynote vivo-scits2014

Progression LevelsManagement and Interpretation for Integrated Applications

• RO Management– Transportation / Access / Citation– Id location of RO “container”– Provenance of RO & contents– Behaviour/lifecycle of RO & contents– Policies

• RO Interpretation– What the RO and its content mean– How they can be compared and

validated– How they can be used, executed, linked

• Interpretation variations– Type (e.g. Workflows)– Discipline (e.g. Biology)– Task (e.g. Discovery, Execution)– Activity (e.g. Experiment)

Page 56: Goble keynote vivo-scits2014

Checklists

Versio

nin

g

Pro

venance

Dependencies

More Stakeholders

& ServicesCitation minimum

More specialised detail

Fewer but more specialised

stakeholders & services

AnnotationProfiles

.

Depth: how deeply described

Coverage: how much is covered.

Progression levelsSemantic Framework

Page 57: Goble keynote vivo-scits2014

Checklists

Versio

nin

g

Pro

venance

Dependencies

NISO-JATS

EXPO, ISAJERM, OBI

MIAME, SBML

GIT

MIM Ontology

PROVPAVVoID

Puppet Docker

Make

PAV

RO Model roevowfprov

wfdesc

SysBio Workflows

DCAT

AnnotationProfiles

.

Depth: how deeply described

Coverage: how much is covered.

Progression levelsSemantic FrameworkExperiment

VIVO-ISF

DC

Page 58: Goble keynote vivo-scits2014

Checklistsaka Minimum Information Models

Safety, quality, consistency

Validation, monitoring Common in experimental

science Checklists defined in

terms of the RO model and its annotations

Services execute against model and an RO’s annotations

Zhao et. al. A Checklist-Based Approach for Quality Assessment of Scientific Information 3rd

In. Workshop on Linked Science, 2013

Minim Checklist Ontology to describe checklists

Must, Should…Cardinalities…Rules…

http://purl.org/net/mim/ns

Page 59: Goble keynote vivo-scits2014

Towards Smart Integrated Applications & Mediation

1. Id & Cite fluid things2. First class citizenship &

uniform handling of artifacts

3. Compound 4. Mixed, leaky Containers5. Span outcomes, evolve

outputs, emergence6. Layered interpretation and

management profiles using standards

7. Machine-processable8. Technology Independent

Bechhofer, Why linked data is not enough for scientists, DOI: 10.1016/j.future.2011.08.004

Page 60: Goble keynote vivo-scits2014

Towards Smart Integrated Applications & Mediation

Bechhofer, Why linked data is not enough for scientists, DOI: 10.1016/j.future.2011.08.004

1. Id & Cite fluid things2. First class citizenship &

uniform handling of artifacts

3. Compound 4. Mixed, leaky Containers5. Span outcomes, evolve

outputs, emergence6. Layered interpretation and

management profiles using standards

7. Machine-processable8. Technology Independent

Page 61: Goble keynote vivo-scits2014

Research Objects Frameworka systematic approach to representing

a different unit of scholarship

“development” view“logical” view

“process” view “physical” view

SERVICESPOLICIES

LIFECYCLESMETADATA PROFILES

Page 62: Goble keynote vivo-scits2014

Lets Bake Research Objects!

Page 63: Goble keynote vivo-scits2014

Open Archival Information System Pilot

ROs are “Information Packages”

ROManagerRODL

Page 64: Goble keynote vivo-scits2014

• A single, transferable object encapsulates description and resources – Download, transfer, publish

• ZIP-based format + manifest describes aggregation and annotations– Unpack with standard

tooling

• JSON-LD for manifest– Lightweight linked-data

format– Use JSON tooling and

services

Baking with off the shelf platforms

OMEX archive

bundle

Adobe

UC

FO

RE

PR

OV

OD

F

Page 65: Goble keynote vivo-scits2014

• Work with local folder structure.– Version: github. – Metadata: Local tooling – Metadata about

aggregation and its resources: “hidden folder”

• Zenodo/figshare pull snapshot from github– DOIs for aggregation– new DOIs: release cycles

Baking with off the shelf platforms

http://dx.doi.org/10.6084/m9.figshare.1031591

Page 66: Goble keynote vivo-scits2014

FARSITE

coded descriptions of clinical study cohorts

an NHS tool to assess the feasibility of gathering a

cohort

packages codes, study, and metadata

Home Baking

Page 67: Goble keynote vivo-scits2014

In the WildSafari

Page 68: Goble keynote vivo-scits2014

integrated database and journal

http://www.gigasciencejournal.com

galaxy.cbiit.cuhk.edu.hk[Peter Li]

Page 69: Goble keynote vivo-scits2014

Nanopub: represents structured data along with its provenance in a single publishable and citable entry

Galaxy workflows: re-enact the analysis

Research Object: aggregates the (digital) resources contributing to findings of (computational) research (results, data and software) as citable compound digital objects

http://isa-tools.github.io/soapdenovo2/http://sandbox.wf4ever-project.org/portal/ro?ro=http://sandbox.wf4ever-project.org/rodl/ROs/SOAP2denovo2-Aureus/

[Alejandra Gonzalez-BeltranPhilippe Rocca-Serra]

Page 70: Goble keynote vivo-scits2014

what’s the least we can do? how might ROs minted and used by science teams?

how might ROs be implemented and used by developer teams?

Standards

ModelsPlatforms

Id SchemesResolution

Light touchExtensibleInfiltration

Mapping

Making,Curating, Using

Nudging

Sharing

Linking

Infiltration

Embedding into and changing work practices

TOOLS

Citing

Technical Social

Reward

Mixed stewardship

CitationSchemes

Fragility

Page 71: Goble keynote vivo-scits2014

[Norman Morrison]

Page 72: Goble keynote vivo-scits2014

(meta)Data Capture Platforms

Process Capture Platforms

Page 73: Goble keynote vivo-scits2014

Stealthy not Sneakyto reduce the frictioninstrument the world

IncrementalJIJIT not JIC

Focus on Personal Productivity

not Public Good

Auto-magical

From made reproducible to born reproducibleWhat’s the least we can do?

Page 74: Goble keynote vivo-scits2014

Knowledge TurnsTransportation & MediationUnit of Scholarly CurrencyContext, ComparisonDistributed: Search, Discover, Index, Harvest, Port

Research TurnsRelease model: Evolution, Emergence, Discourse, Comparison, Historical reviewForks, Merges & FixivityFlow across groups, projects and articlesAnti-Salami, Threaded Publications

Schopf, Treating Data Like Software: A Case for Production Quality Data, JCDL 2012

Goble, De Roure, Bechhofer, Accelerating Knowledge Turns, I3CK, 2013

Profile FocusBody of knowledge around methods, workflows, software, data, person, rather than publication.First class citation, credit and respect

Page 75: Goble keynote vivo-scits2014

Open Research Practice is (increasingly) like Open Source Software Practice.

(Which we know a lot about)

Page 76: Goble keynote vivo-scits2014

FAIR research practice benefits from a shared and principled approach for identification, aggregation and annotation of research components of all kinds.

– Using existing standards, vocabularies, frameworks, platforms, infrastructures. Using linked data and semantic interoperability

VIVO - to represent the full context of researchers’ work.

SciTS – to study the research process and research collaboration

Page 77: Goble keynote vivo-scits2014

http://www.researchobject.org

Page 78: Goble keynote vivo-scits2014

• Barend Mons• Sean Bechhofer• Philip Bourne• Matthew Gamble• Raul Palma• Jun Zhao• Alan Williams• Stian Soiland-Reyes• Paul Groth• Tim Clark• Juliana Freire• Alejandra Gonzalez-Beltran• Philippe Rocca-Serra• Ian Cottam

All the members of the Wf4Ever teamiSOCO: Intelligent Software Components S.A., SpainUniversity of Manchester, School of Computer Science, Manchester, United KingdomUniversity of Oxford, Department of Zoology, Oxford, UKPoznan Supercomputing and Networking Center. Poznan, PolandIAA: Instituto de Astrofísica de Andalucía, Granada, SpainLeiden University Medical Centre, Centre for Human and Clinical Genetics, The Netherlands

Colleagues in Manchester’s Information Management GroupRO Advisory Board Members

http://www.researchobject.orghttp://www.wf4ever-project.org