Post on 17-Feb-2017
Reproducibility, Research Objects and Reality
Professor Carole GobleThe University of Manchester, UKSoftware Sustainability Institute, UKELIXIR UK, FAIRDOM Association e.V.
carole.goble@manchester.ac.uk
University of Leiden, The Netherlands, 24 November 2016
Acknowledgements• Dagstuhl Seminar 16041 , January 2016
– http://www.dagstuhl.de/en/program/calendar/semhp/?semnr=16041• ATI Symposium Reproducibility, Sustainability and Preservation , April
2016– https://turing.ac.uk/events/reproducibility-sustainability-and-preservation/– https://osf.io/bcef5/files/
• C Titus Brown• Juliana Freire• David De Roure• Stian Soiland-Reyes• Barend Mons• Tim Clark• Daniel Garijo• Norman Morrison• Katy Wolstencroft
Phil BourneNatalie StanfordJacky SnoepStuart OwenMarco RoosKristina HettneAlan WilliamsSean BechhoferIan ForeRafael Jimenez…. And many more
Michael CrusoePaul GrothNiall Beard
Context: Computational Science
http://tpeterka.github.io/maui-project/From: The Future of Scientific Workflows, Report of DOE Workshop 2015, http://science.energy.gov/~/media/ascr/pdf/programdocuments/docs/workflows_final_report.pd
1. Observational, experimental
2. Theoretical3. Simulation4. Data intensive
Motivation: Knowledge Turningresearch infrastructures
• Computational tools• Sharing platforms• Knowledge Exchange• Reproducible
research• Software and data
practices• Policies
[Josh Sommer, for the picture]
Reproducibility Rampancy
NIH Rigor and Reproducibilityhttps://www.nih.gov/research-training/rigor-reproducibility
Plenty of guidelines
cos.io/top
Plenty of principles
https://wellcomeopenresearch.org/ Nature Scientific Data
Data as a first class citizen + Data Citation
Scholarly Communications Providers
Software as a first class citizen + Software Citation
Funders
http://www.acmedsci.ac.uk/policy/policy-projects/reproducibility-and-reliability-of-biomedical-research/
republic of science*
regulation of science
institution cores / libraries / public services
*Merton’s four norms of scientific behaviour (1942)
FAIRFindable
Accessible
Interoperable
ReusableIntelligible
Reproducible
Citable
Track & Countable
http://ec.europa.eu/research/participants/data/ref/h2020/grants_manual/hi/oa_pilot/h2020-hi-oa-data-mgt_en.pdf
Research Infrastructure for FAIR Management and Sharing ofData, Operating Procedures, ModelFor Systems and Synthetic Biology Projects
Research Infrastructure for FAIR Data for Life Sciences in Europe
Data-Driven Science
designcherry picking data, random seed reporting, non-independent bias, poor positive and negative controls, dodgy normalisation, arbitrary cut-offs, premature data triage, un-validated materials, improper statistical analysis, poor statistical power, stop when “get to the right answer”, software misconfigurations misapplied black box softwarereportingincomplete reporting of software configurations, parameters & resource versions, missed steps, missing data, vague methods, missing softwareEmpirical StatisticalComputational
V. Stodden, IMS Bulletin (2013)
Reproducibility and reliability of biomedical research: improving research practice
“When I use a word," Humpty Dumpty said in rather a scornful tone, "it means just what I choose it to mean - neither more nor less.”
Carroll, Through the Looking Glass
re-compute
replicatererun
repeat
re-examine
repurpose
recreate
reuse
restorereconstruct review
regeneraterevise
recycle
redo
robustness tolerance
verification compliance validation assurance
remix
Scientific publications goals: (i) announce a result(ii) convince readers its correct.
Papers in experimental science should describe the results and provide a clear enough protocol to allow successful repetition and extension.
Papers in computational science should describe the results and provide the complete software development environment, data and set of instructions which generated the figures.
Virtual Witnessing*
*Leviathan and the Air-Pump: Hobbes, Boyle, and the Experimental Life (1985) Shapin and Schaffer.
Jill Mesirov
David Donoho
Computational Complex Assemblies
Remote Calls
“Micro” Reproducibility
“Macro” Reproducibility
Fixivity
Validate
Verify
Trust
Repeatability:“Sameness”Same result1 Lab1 experiment
Reproducibility:“Similarity”Similar result> 1 Lab> 1 experimentwhy the
differences?
https://2016-oslo-repeatability.readthedocs.org/en/latest/repeatability-discussion.html
Validate
Verify
Method Reproducibilitythe provision of enough detail about study procedures and data so the same procedures could, in theory or in actuality, be exactly repeated.
Result Reproducibility (aka replicability)obtaining the same results from the conduct of an independent study whose procedures are as closely matched to the original experiment as possibleGoodman, et al Science Translational Medicine 8
(341) 2016
Validate
Verify
ProductivityTrack differences
Validate
Verify
reviewers want additional workstatistician wants more runsanalysis needs to be repeatedpost-doc leaves, student arrivesnew/revised datasetsupdated/new versions of algorithms/codessample was contaminatedbetter kit - longer simulationsnew partners, new projects
Personal & Lab Productivity
Public GoodReproducibility
Computational “Datascopes”
Methodstechniques, algorithms, spec. of the steps, models
Materialsdatasets, parameters, algorithm seedsExperim
ent
Instrumentscodes, services, scripts, underlying libraries, workflows, ref datasets
Laboratorysw and hw infrastructure, systems software, integrative platformscomputational environment
Setup
“Datascope” Practicalities
MethodsMaterialsExperim
ent
InstrumentsLaboratory
Setup
Change Dependenciesscience, methods, datasetsquestions stay, answers change
breakage, labs decay, services, techniques and instruments change, updated datasets, services, codes, hardwaresoftware entropy
one offs, streams,stochastics, sensitivities,scale, non-portable data
supercomputer accessnon-portable softwarelicensing restrictionsunreliable resources and third party codescomplexity
Blackboxes
blackbox software
hidden manual steps
blackbox software
hidden manual steps
Active Instrument Byte level
preservation
Reproduce by RunningReproduce by Reading
Archived RecordPrepare to repair
ELNs
Markup LanguagesReporting Guidelines
Common Formats
Community vocabularies
Record AllAutomate AllContain AllExpose All
FindableAccessibleInteroperableReusable
provenance
portability preservation
robustnessversioning
access descriptionstandards
common APIslicensing
standards,common metadata
change variation sensitivity
discrepancy handling
packaging, containers
FAIR RACE shades of reproducibility
dependenciesstepsids
A robust infrastructure for biological information.
bio.tools
https://usegalaxy.org/
Workflow DescriptionWorkflows PreservationWorkflow PortabilityWorkflow Interoperability
Workflow Preservation and ExchangeExperimentsWorkflows & Workflow RunsWorkflow Commons
Third Party ServicesScattered resources
Workflow Preservation and ExchangeExperimentsWorkflows & Workflow RunsWorkflow Commons
Third Party ServicesScattered resources
Rich descriptionsPrepare to Repair
Standards-based metadata framework for bundling resources with context
Citable Reproducible Packaging
Metadata for bundling resources scattered and stored somewhere else
Container
Research Object in a nutshell
Packaging content & links: Zip files, BagIt, Docker
images
Catalogues & Commons Platforms: FAIRDOM, myExperiment
Manifest Constructi
on
Aggregates link things
togetherAnnotations
about things & their
relationships
Container
Research Object in a nutshell
Manifest Descripti
onDependencies
what else is needed
Versioning its evolution
Checklists what should be there
Provenance
where it came from
Identificationlocate things
regardless whereid
Packaging content & links: Zip files, BagIt, Docker
images
Catalogues & Commons Platforms: FAIRDOM, myExperiment
Manifest Constructi
on
Aggregates link things
togetherAnnotations
about things & their
relationships
Container
Research Object Profile for Workflows…
Manifest Descripti
onIdentificationlocate things
regardless where
Minimum informationfor one content type
Common properties
among content types
Research Object Profile for Workflows…
Manifest Descripti
on
Minimum informationfor one content type
Common properties
among content types
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects, J Web Semantics doi:10.1016/j.websem.2015.01.003Hettne KM, et al (2014), Structuring research methods and data with the research object model: genomics workflows as a case study. J. Biomedical Semantics 5: 41
Workflow Research Object Bundles exchange, portability and maintenance
BagIt
workflows packaged into various containers for
sharing
Checksum
Workflow and Workflow Management System Zoo
https://github.com/common-workflow-language/common-workflow-language/wiki/Existing-Workflow-systems
bio.tools
A community led standard way of expressing and running workflows and command line tools using containers
Ontologies for describing tools and their inputs and outputs
Metadata framework for the manifest versioning, file integrity, more metadata about the workflow
Workflow fragment containers
FindableAccessibleInteroperableReusable
DataOperationsModels
Systems and Synthetic Biology Projects
Funder: Legacy!
Partners
Project Support
Community Actions
Platforms, Tools
Web-based Portal Public Commons
50+ projects5 programmes400+ people
22 independentinstallations
Systems Approach…Multiple, interrelated assets, Multiple, dispersed repositories
Literature
SOPS
STANDARDSversioning,
tracking:provenance, parameters,
citation
Operations
Data Mode
ls
FAIR Data and Metadata Standards that help to improve understanding and exchange….
Nicolas Le Novère, Babraham Institute, UK.
…researchers do not always use them....
… model reuse and reproducibility tricky…
Stanford et al The evolution of standards and data management practices in systems biology, Molecular Systems Biology (2015) 11: 851 DOI 10.15252/msb.20156053
Systems Approach…teams, processes, multi-partner, multi-discipline, legacy
P1. BaCell-SysMOThe transition from growing to non-growing Bacillus subtilis cells - A systems biology approach
P2. COSMICSystems Biology of Clostridium acetobutylicum - a possible answer to dwindling crude oil reserves
P3. SUMOSystems Understanding of Microbial Oxygen Responses Escherichia coli
P4. KOSMOBACIon and solute homeostasis in enteric bacteria Escherichia coli
P5. SysMO-LABComparative Systems Biology: Lactic Acid Bacteria: Lactococcus lactis, Enterococcus faecalis, Streptococcus pyogenes
P6. PSYSMOSystems analysis of biotech induced stresses: towards a quantum increase in process performance in the cell factory Pseudomonas putida
P7. SCaRABSystems Biology of a genetically engineered Pseudomonas fluorescens with inducible exo-polysaccharide
production: analysis of the dynamics and robustness of metabolic networks
P8. MOSESMicroOrganism Systems
Biology: Energy and Saccharomyces cerevisiae
P9. TRANSLUCENT Gene interaction networks and models of cation homeostasis in Saccharomyces cerevisiaeP10. STREAM
Global metabolic switching in Streptomyces coelicolor P11. SulfoSYSSilicon cell model for the central carbohydrate metabolism of the archaeon Sulfolobus solfataricus under temperature
variation
P12. SysMO-DBData management groupFunders
Researchers
Publishers
Who is working with wh
ich organism?
What methods are been used to determine enzyme activity?
Under which experimental conditions are
my
partners working on for the measurement
of glucose
concentration?What is the provenance of the parameters for this version of the model?What SOP was used for this
sample?
Where is the validation data for this model?
Is there any group generating kinetic data?
Is this data available?
Track versions of my model
Whats the relationship between the data and model?
Which data belong to which publications?
FAIR
A Commons
fairdomhub.org
Investigation
Study Analysis
Data
Model
SOP(Assay)
….organised in Investigation, Study, Assay/Analysis format….registered using Just Enough Results description
….organised in Investigation, Study, Assay/Analysis format….registered using Just Enough Results description.
Just Enough Results ModelCommon elements
….organised in Investigation, Study, Assay/Analysis format….registered using Just Enough Results description.
Uploaded into theFAIRDOM Store
Linked to entry in Public Archive
Linked to entry in Project store
... aggregating cataloguemetadata across repositories, retain context-> reproduce, reuse
Local Stores
ExternalDatabases
Publishing services
Secure Stores
Model Resources
… in situ reproducible modelsmetadata annotation against standards
model validation, comparison and simulation
SBML Model simulation
Model comparison
Model versioning
Reproducing simulations
[Jacky Snoep, Dagmar Waltemate, Martin Peters, Martin Scharm]
…. Nested Packages
context and credit
Research Objects• Link • Nest• Span • Bundle• Snapshot
Systematic, Standards-based metadata framework for logically and physically bundling resources with context• Exchange• Reproduce• Release packages
Reproducible Exchange and Publishingand better credit
reviewer
Author List: Joe Bloggs; Jane DoeTitle: My Investigation Date: September 2016DOI: https://doi.org/10.15490/seek##
information travels with the data and models
How do we do? Pretty well.Reproducibility window. But that’s ok!
• Can’t contain everything– Pesky Internet in a Box
• Can’t automate everything– Pesky people
• Can’t fix everything– Pesky science
Asthma Research e-Laboratory
Release builds of pharmacological knowledge warehouse
Exchanging large datasets
Samiul Hasan, GSKBiocuration need in Pharma: Drivers from a Translational Bioinformatics Perspective, Poster S161st EASYM Conference, Berlin 2016
Reality
Preparation pain. Goldilocks paradox.
[Norman Morrison]
replication hostility no funding, time, recognition, place to publishresource intensive access to the complete environment
“Data Parasites”“Data Flirters”
“Share Drift”FamilyFriendsPotential FriendsAcquaintancesStrangersRivals
Reciprocity
Using FAIRDOM my own lab colleagues saw what I was doing and called to collaborate!
Jurgen HannstraVrije Universiteit Amsterdam, Netherlands
Trust …
Half of researchers make research data available so they can be used by another.
Most not experienced any direct benefits nor experienced many bad effects.
Caveat: shared but usable?fake sharing
funder requirements
fear data will be misused or
misinterpreted
journal requirementsgood research practice
facilitate collaborationsenable validation and
replicationhigher citation rates
time and effort
new collaborations
extra funding for cost of data prep
enhance their academic reputationfeedback on how other researchers were using their data
taken into account in funding
taken into account in career
jeopardise future publications
its not ready to sharescrutiny scruples
answering questions
I won’t get credited
Metadata in by side effectTooling for annotations and checklist templates for different types of assay data.
Embed ontologies into Excel templates
Excel spreadsheets enriched with ontology annotations
Upload, extract metadata and register
http://www.rightfield.org.uk
Spreadsheet Ramps!!
Sharing by side effect …. libertarian paternalism
[Kristian Garza]
Finding and Citing by side effect
• Schema.org• Structured
markup in web pages
• Supported by Content Management Systems
• Harvested by search engines
• Builds snippets and sidebars
Bioschemas.org
Datarepository
Datarepository
TrainingResource
Bioschemas Bioschemas Bioschemas
Search engine Bio RegistriesBiosharingOLS, TeSSbio.tools
UKCRC TissueDirectory
bioCADDIE DATAMED
PDBe UniProtInterpro Molgenis Pfam
Gene3DBiosamplesBiobank websitesBRENDA HPA
TransPlantEGA Beacons
EBI-SearchGoogle
Finding and Citing by side effectBioschemas.org
Big co-operative data-driven science makes reproducibility
desirable but also means dependency and change are to be
expected
Words matter.50 Shades of Reproducibility.
form vs functionReproducibility is not a end.
Beware zealots.
Amplify Side effectsThink Research Objects!