Ipaw14 presentation Quan, Tanu, Ian

42
Auditing and Maintaining Provenance in Software Packages Quan Pham 1 Tanu Malik 2 Ian Foster 1,2 Department of Computer Science 1 and Computation Institute 2 , The University of Chicago, Chicago, IL 60637, USA [email protected], [email protected] Presented by Boris Glavic Illinois Institute of Technology IPAW14 June, 10 th , 2014 Provenance in Software Packages June, 10 th , 2014 1 / 29

description

DISCLAIMER: These are not my slides, but these slides are for the IPAW paper by Quan Pham, Tanu Malik, and Ian Foster: Auditing and Maintaining Provenance in Software Packages

Transcript of Ipaw14 presentation Quan, Tanu, Ian

Auditing and Maintaining Provenance inSoftware Packages

Quan Pham1 Tanu Malik2 Ian Foster1,2

Department of Computer Science1 and Computation Institute2,The University of Chicago,

Chicago, IL 60637, [email protected], [email protected]

Presented by Boris Glavic

Illinois Institute of Technology

IPAW14

June, 10th, 2014

Provenance in Software Packages June, 10th, 2014 1 / 29

Outline

1 Introduction

2 Software Pipeline Usecase

3 CDE-SP: Software Provenance in CDE

4 Experiment and Evaluation

5 Related Work

6 Conclusion

Provenance in Software Packages June, 10th, 2014 2 / 29

Current Solutions for Ensuring Reproducibility and Issues

1 Publish source code and data− GitHub, Figshare, Research CompendiaX Pros: (in many cases) easy to accomplish× Cons: need to recompile and re-execute

2 Publish software package including source code, data, andenvironment dependencies− CDE, RunMyCode.orgX Pros: re-execute without installation× Cons: not easy to combine and merge shared packages

3 Publish a virtual machine image (VMI) that includes OS, source code,data, and environment− Cloud BioLinux (NEBC), Swift Appliance (RDCEP)X Pros: no additional modules or components needed to rerun× Problem: too hard to provision and understand

Introduction Provenance in Software Packages June, 10th, 2014 3 / 29

Reproducibility Problem

Our philosophy:”... releasing shoddy VMs is easy to do, but it doesn’t help you learn howto do a better job of reproducibility along the way. Releasing softwarepipelines, however crappy, is on the path towards better reproducibility.”

C. Tituss Brown1

Reproducibility problem: How can we make it easy to combine andmerge shared packages, while correctly attributing authorship of softwarepackages?

No need to provision VMIs or publish simply source code and data.

1http://ivory.idyll.org/blog/vms-considered-harmful.htmlIntroduction Provenance in Software Packages June, 10th, 2014 4 / 29

Problem Scope

Use CDE2 to capture and create portable software package

Extend, partially re-use, and combine CDE packages to create newreproducible software pipelines

Attribute authorship of software packages in new software pipelines

CDE has an OVERLAP conflict!

2Guo, P.J., Engler, D.: CDE: using system call interposition to automatically createportable software packages. USENIX Association, Portland, OR (2011)

Introduction Provenance in Software Packages June, 10th, 2014 5 / 29

CDE

Create a portable software packagewithout installation, configuration, or privilege permissions

Audit mode to create a CDE package

Introduction Provenance in Software Packages June, 10th, 2014 6 / 29

CDE

Create a portable software packagewithout installation, configuration, or privilege permissions

Audit mode to create a CDE package

Introduction Provenance in Software Packages June, 10th, 2014 6 / 29

CDE

Create a portable software packagewithout installation, configuration, or privilege permissions

Audit mode to create a CDE package

Introduction Provenance in Software Packages June, 10th, 2014 6 / 29

CDE

Create a portable software packagewithout installation, configuration, or privilege permissions

Audit mode to create a CDE package

Introduction Provenance in Software Packages June, 10th, 2014 6 / 29

CDE

Create a portable software packagewithout installation, configuration, or privilege permissions

Audit mode to create a CDE package

Introduction Provenance in Software Packages June, 10th, 2014 6 / 29

CDE

Create a portable software packagewithout installation, configuration, or privilege permissions

Audit mode to create a CDE package

Introduction Provenance in Software Packages June, 10th, 2014 6 / 29

CDE

Create a portable software packagewithout installation, configuration, or privilege permissions

Audit mode to create a CDE package

Introduction Provenance in Software Packages June, 10th, 2014 6 / 29

CDE

Create a portable software packagewithout installation, configuration, or privilege permissions

Audit mode to create a CDE package

Introduction Provenance in Software Packages June, 10th, 2014 6 / 29

CDE - Execution Mode

Introduction Provenance in Software Packages June, 10th, 2014 7 / 29

CDE - Execution Mode

Introduction Provenance in Software Packages June, 10th, 2014 7 / 29

CDE - Execution Mode

Introduction Provenance in Software Packages June, 10th, 2014 7 / 29

CDE - Execution Mode

Introduction Provenance in Software Packages June, 10th, 2014 7 / 29

CDE - Execution Mode

Introduction Provenance in Software Packages June, 10th, 2014 7 / 29

CDE - Execution Mode

Introduction Provenance in Software Packages June, 10th, 2014 7 / 29

Software Pipelines Contain CDE packages

A software pipeline consists many individual software modules

A software module depends on externally-developed libraries

A software module is often packaged together with specific versions oflibraries

Introduction Provenance in Software Packages June, 10th, 2014 8 / 29

RDCEP Usecase

Alice, Bob, and Charlie are scientists at the Center for Robust DecisionMaking on Climate and Energy Policy (RDCEP)

A develops data integration methods to produce higher-resolutiondatasets depicting inferred land use over time.

B develops computational models to do model-based comparativeanalysis. B’s software environment consists of A’s software modulesto produce high-resolution datasets.

C uses A and B’s software modules within data-intensivecomputing methods to run them in parallel.

The Center wants to predict future yields of staple agriculturalcommodities given changes in the climate.

C's Package (Merge from B's)

B's Package (from A's)

A's Package

Parallel init Aggregation Generate images Model-based analysis Parallel summary

Generate images Model-based analysisRetrive data Aggregation

Software Pipeline Usecase Provenance in Software Packages June, 10th, 2014 9 / 29

A’s Experiment & Package

A’s packagecde-root

path to A’s filesa-experiment.shretrieve-dataaggregationgenerate-imagef1, f2, a-output

path to common libslibc.so

Re-execute A’s experiment:cde-exec a-experiment.shcat a-experiment.sh

./retrieve-data f1

./aggregation f1 f2

./generate-image f2 a-output

Software Pipeline Usecase Provenance in Software Packages June, 10th, 2014 11 / 29

B’s Experiment & Package

B’s packagecde-root

path to A’s files[...]

path to B’s filesb-experiment.shanalysisb-output

path to common libslibc.so

Re-execute B’s experiment:cde-exec b-experiment.shcat b-experiment.sh

cd path to A’s experimentcde-exec a-experiment.shcd path to B’s files./analysis path to A’s files/a-output b-output

Software Pipeline Usecase Provenance in Software Packages June, 10th, 2014 12 / 29

C’s Experiment & Package

C’s packagecde-root

path to A’s files[...]

path to B’s files[...]

path to C’s filesc-experiment.shparallel-initparallel-summaryc-output

path to common libslibc.so

Re-execute C’s experiment:cde-exec c-experiment.shcat c-experiment.sh

parallel-init path to A’s files/f4cd path to A’s filescde-exec ./aggregation f4 f5cde-exec ./generate-image f5 f6cd path to B’s filescde-exec ./analysis path to A’s files/f6 f7cd path to C’s files./parallel-summary path to B’s files/f7 c-output

Software Pipeline Usecase Provenance in Software Packages June, 10th, 2014 13 / 29

Dependency Overlap in Multiple cde-root Directories

Software Pipeline Usecase Provenance in Software Packages June, 10th, 2014 14 / 29

File Overlap of Different Linux Distributions

RH SUSE U12 U13

Amz 5498 / 23k 3184 / 11k 1203 / 5.4k 1819 / 5.5k

RH 3861 / 12k 1654 / 6.6k 2223 / 6.3k

SUSE 1245 / 3.9k 2085 / 6.4k

U12 8226 / 24k

Table 1 : Ratio of different files having the same path in 5 popular AMIs. Thedenominator is number of files having the same path in two distributions, and thenumerator is the number of files with the same path but different md5 checksum.Ommited are manual pages in /usr/share/ directory.

Amz Amazon Linux AMIRH Red Hat Enterprise Linux 6.4

SUSE SUSE Linux Enterprise Server 11U12 Ubuntu Server 12.04.3 LTSU13 Ubuntu Server 13.10

Software Pipeline Usecase Provenance in Software Packages June, 10th, 2014 15 / 29

Re-direction in Multiple cde-root Directories

Software Pipeline Usecase Provenance in Software Packages June, 10th, 2014 16 / 29

CDE-SP

CDE-SP: Enhanced CDE that includes software provenance

Describe tools and methods to audit, store, and query provenanceProvenance queries

Determine the environment under which a dependency was buildExamine the dependencies which must be presentAnswer if packages in a pipeline can satisfy a new packageAttribute authorship of software packages in a pipeline

Combine and validate authorship from stored provenance

Software Pipeline Usecase Provenance in Software Packages June, 10th, 2014 17 / 29

CDE-SP Audit

Objectives

Capture additional details of the origins of a library or a binary

Use these details for compiling and creating software pipelines

Methods

Create a dependency tree

Process system calls are monitoredWhenever a process executes a file system call, a dependency of thatprocess is recordedDependency can be a data file or a shared library

Extract information about binaries and required shared libraries

file, ldd, strings, and objdump UNIX commandsuname -a and function getpwuid(getuid())

CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th, 2014 18 / 29

CDE-SP Audit

Objectives

Capture additional details of the origins of a library or a binary

Use these details for compiling and creating software pipelines

Methods

Create a dependency tree

Process system calls are monitoredWhenever a process executes a file system call, a dependency of thatprocess is recordedDependency can be a data file or a shared library

Extract information about binaries and required shared libraries

file, ldd, strings, and objdump UNIX commandsuname -a and function getpwuid(getuid())

CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th, 2014 18 / 29

CDE-SP Audit

Objectives

Capture additional details of the origins of a library or a binary

Use these details for compiling and creating software pipelines

Methods

Create a dependency tree

Process system calls are monitoredWhenever a process executes a file system call, a dependency of thatprocess is recordedDependency can be a data file or a shared library

Extract information about binaries and required shared libraries

file, ldd, strings, and objdump UNIX commandsuname -a and function getpwuid(getuid())

CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th, 2014 18 / 29

Storage

Store provenance within the package itself

Use LevelDB: a fast and light-weight key-value storage library

Encode in the key the UNIX process identifier along with spawn time

Key Value Explanationpid.PID1.exec.TIME PID2 PID1 wasTriggeredBy PID2

pid.PID.[path, pwd, args] VALUES Other properties of PID

io.PID.action.IO.TIME FILE(PATH) PID wasGeneratedBy / wa-sUsedBy FILE(PATH)

meta.agent USERNAME User information

meta.machine OSNAME operating system distribution

Table 2 : LevelDB key-value pairs that store file and process provenance. Capital letter words are arguments.

CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th, 2014 19 / 29

Query

LevelDB provides a minimal API for querying

Simple, light-weight query interface

Input: a program whose dependencies need to be retrievedOutput: a GraphViz file displaying file and process dependencies

Use depth first search algorithm to create a dependency tree with theinput program as its root

Exclusion option to remove uninteresting dependencies:/lib/, /usr/lib/, /usr/share/, /etc/

CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th, 2014 20 / 29

Authorship of Software Modules

Combine authorship of the contributing packages

Validate authorship from the provenance stored in the originalpackage

Generate the subgraph associated with the part of the new packageUse subgraph isomorphism (NP-Hard) to validate with the originalprovenance graphMatch provenance nodes of processes with the same paths of theirbinaries and working directoriesMatch provenance nodes of files with the same path

CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th, 2014 21 / 29

Experiments

Performance of CDE-SP

Auditing performance overheadDisk storage increaseProvenance query runtime

Redirection overhead when multiple UUID-based directories arecreated

Compare the lightweight virtualization approach of CDE-SP withKameleon3, a heavyweight virtualization approach used forreproducibility

Experiments were run on Ubuntu 12.04 LTS workstation with an 8GBsRAM and 8-core Intel(R) processor clocking at 1600MHz.

3Emeras, J., Richard, O., Bzeznik, B.: Reconstructing the software environment ofan experiment with kameleon (2011)

Experiment and Evaluation Provenance in Software Packages June, 10th, 2014 22 / 29

Performance & Size Overhead

Pipeline with two applications: Aggregation and Generate Image

2.1% slowdown of CDE-SP vs. 0-30% CDE virtualization overhead4

LevelDB database size 236kB (0.03% package size increase) containsapproximately 12,000 key-value pairs

CreatePackage

Execution Disk Usage Provenance Query

CDE 852.6±2.4 568.8±2.4 732MBCDE-SP 870.5±2.5 569.5±1.8 732MB+236kB 0.4±0.03

(seconds) (seconds) (seconds)

Table 3 : Increase in CDE-SP performance is negligible in comparison with CDE

4Guo, P.J., Engler, D.: CDE: using system call interposition to automatically createportable software packages. USENIX Association, Portland, OR (2011)

Experiment and Evaluation Provenance in Software Packages June, 10th, 2014 23 / 29

Redirection Overhead in CDE-SP

Pipelined output of Aggregation to input of Generate Image

3 output files of Aggregation package were moved to Generate Imagepackage

2 cross-package execve() system calls

Less than a 1% slowdown of CDE-SP

Experiment and Evaluation Provenance in Software Packages June, 10th, 2014 24 / 29

Kameleon

Use the Kameleon engine to make a bare bone VM appliance

Self-written YAML-formatted recipesSelf-written macrosteps and microsteps

Kameleon can create virtual machine appliances in different formatsfor different Linux distributions

Generates bash scripts to create an initial virtual image of a LinuxdistributionPopulates the image with more Linux packagesPopulates with content of a CDE-SP package

Experiment and Evaluation Provenance in Software Packages June, 10th, 2014 25 / 29

CDE-SP Vs Kameleon

0

200

400

600

800

1000

1200

1400

1600

Kameleon CDE-SP

Seco

nds

Figure 1 : Overhead when using CDE with Kameleon VM appliance

Experiment and Evaluation Provenance in Software Packages June, 10th, 2014 26 / 29

Related Work

Research Objects: packages scientific workflows with auxiliaryinformation about workflows, including provenance information andmetadata, such as the authors, the version

CDE and Sumatra can capture an execution environment in alightweight fashion

SystemTap, being a kernel-based tracing mechanism, has betterperformance compared to ptrace but needs to run at a higherprivilege level

Provenance-to-Use (PTU) and ReproZip include provenance inself-contained software packages

Related Work Provenance in Software Packages June, 10th, 2014 27 / 29

Conclusion

CDE does not encapsulate provenance of associated dependencies ina software package

The lack of information about the origins of dependencies in asoftware package creates issues when constructing software pipelinesfrom packages

CDE-SP can include software provenance as part of a softwarepackage

CDE-SP can use software package provenance to build softwarepipelines

CDE-SP can maintain provenance when used to construct softwarepipelines

Conclusion Provenance in Software Packages June, 10th, 2014 28 / 29

Acknowledgments

Neil Best at The University of Chicago

Joshua Elliott at The Columbia University

Justin Wozniak at Argonne National Laboratory

Allison Brizius at RDCEP Center

NSF grant SES-0951576, GEO-1343816

Acknowledgments Provenance in Software Packages June, 10th, 2014 29 / 29