Reproducibile scientific workflows - Acting on Change 2016

12
Reproducible scientific workflows Tomasz Miksa Vienna University of Technology & SBA Research, Austria

Transcript of Reproducibile scientific workflows - Acting on Change 2016

Page 1: Reproducibile scientific workflows - Acting on Change 2016

Reproducible scientific workflows

Tomasz MiksaVienna University of Technology

& SBA Research, Austria

Page 2: Reproducibile scientific workflows - Acting on Change 2016

Tomasz Miksa [email protected]

eScience and Research Infrastructures

Scientists exchange- facilities- resources- services- datasets

Research requires- special tooling and software- workflows to

• capture• transform• visualize• interpret the data

Page 3: Reproducibile scientific workflows - Acting on Change 2016

Tomasz Miksa [email protected]

Taverna Workflow

Workflows and Context

‘Workflows’ can be- ad hoc commands and scripts

executed manually - well-structured processes

executed within a controlled environment

Workflows - share infrastructure with other processes- delegate tasks to tools installed in the system- require specific configurations- can use distributed systems

#!/bin/bash

# fetch datajava -jar GestBarragensWSClientIQData.jarunzip -o IQData.zip

# fix encoding#iconv -f LATIN1 -t UTF-8 iq.r > iq_utf8.r

# generate referencesR --vanilla < iq_utf8.r > IQout.txt

# create pdfpdflatex iq.texpdflatex iq.tex

Script

Page 4: Reproducibile scientific workflows - Acting on Change 2016

Tomasz Miksa [email protected]

Reproducibility

Current studies show very low reproducibility in

- medicine

- economy

- computer science

Reproducibility requires

- well documented research workflows

- precise information

on the experiment's environment

Page 5: Reproducibile scientific workflows - Acting on Change 2016

Tomasz Miksa [email protected]

Reproducibility Neuroanatomical studies

FreeSurfer Software- cortical thickness and volume of neuroanatomical structures

Different - FreeSurfer Versions

• v4.3.1, v4.5.0, v5.0.0

- Workstation • Mac, Hewlett‐Packard

- Operating system version• OSX 10.5, OSX 10.6

E. Gronenschild, P. Habets, H. I. L. Jacobs, R. Mengelers, N. Rozendaal, J. van Os, and M. Marcelis, “The effects of freesurfer version, workstation type, and macintosh operating system version on anatomical volume and cortical thickness measurements,” 2012.

Page 6: Reproducibile scientific workflows - Acting on Change 2016

Tomasz Miksa [email protected]

Reproducibility Computer Science

613 papers in 8 ACM conferences

C. Collberg and T. Proebsting, “Measuring reproducibility in computer systems research,” 2014. [Online]. Available: http://reproducibility.cs.arizona.edu/tr.pdf

Page 7: Reproducibile scientific workflows - Acting on Change 2016

Tomasz Miksa [email protected]

ReproducibilityComputer Science

E-mail responses from authors- Wrong version- Code will be available soon- Programmer left- Bad backup practices- Commercial code- Proprietary academic code- Intellectual property- No intention to release- …

Page 8: Reproducibile scientific workflows - Acting on Change 2016

Variety of solutions

Workflow systems Interactive notebooks Virtualisation Containers Code repositories Automated builds

Service monitoring Metadata standards Provenance Preservation planning Repositories

Page 9: Reproducibile scientific workflows - Acting on Change 2016

Tomasz Miksa [email protected]

TIMBUS - Process preservation

Digital preservation of business processes Based on risk management Context modelling is the key

Page 10: Reproducibile scientific workflows - Acting on Change 2016

Tomasz Miksa [email protected]

TIMBUS - Context modelling

Context model Automated extractors Process execution monitoring Service monitoring

Page 11: Reproducibile scientific workflows - Acting on Change 2016

TIMBUS - Risk mitigation strategies

Metadata and documentation

Migration- File formats- Storage media- Alternative services

• Open source service• In‐housing of services

Emulation Virtualisation Mock‐up of systems

Page 12: Reproducibile scientific workflows - Acting on Change 2016

Tomasz Miksa [email protected]

Summary

Scientific experiments- workflows for data processing with software dependencies

Risks affecting reproducibility - low due to insufficient experiment description

Solutions for improving reproducibility- improve data management, sharing and reuse

TIMBUS approach for process preservation- based on risk management practices- using context modelling to evaluate preservation alternatives