Ipaw14 presentation Quan, Tanu, Ian
-
Upload
boris-glavic -
Category
Science
-
view
232 -
download
3
description
Transcript of Ipaw14 presentation Quan, Tanu, Ian
Auditing and Maintaining Provenance inSoftware Packages
Quan Pham1 Tanu Malik2 Ian Foster1,2
Department of Computer Science1 and Computation Institute2,The University of Chicago,
Chicago, IL 60637, [email protected], [email protected]
Presented by Boris Glavic
Illinois Institute of Technology
IPAW14
June, 10th, 2014
Provenance in Software Packages June, 10th, 2014 1 / 29
Outline
1 Introduction
2 Software Pipeline Usecase
3 CDE-SP: Software Provenance in CDE
4 Experiment and Evaluation
5 Related Work
6 Conclusion
Provenance in Software Packages June, 10th, 2014 2 / 29
Current Solutions for Ensuring Reproducibility and Issues
1 Publish source code and data− GitHub, Figshare, Research CompendiaX Pros: (in many cases) easy to accomplish× Cons: need to recompile and re-execute
2 Publish software package including source code, data, andenvironment dependencies− CDE, RunMyCode.orgX Pros: re-execute without installation× Cons: not easy to combine and merge shared packages
3 Publish a virtual machine image (VMI) that includes OS, source code,data, and environment− Cloud BioLinux (NEBC), Swift Appliance (RDCEP)X Pros: no additional modules or components needed to rerun× Problem: too hard to provision and understand
Introduction Provenance in Software Packages June, 10th, 2014 3 / 29
Reproducibility Problem
Our philosophy:”... releasing shoddy VMs is easy to do, but it doesn’t help you learn howto do a better job of reproducibility along the way. Releasing softwarepipelines, however crappy, is on the path towards better reproducibility.”
C. Tituss Brown1
Reproducibility problem: How can we make it easy to combine andmerge shared packages, while correctly attributing authorship of softwarepackages?
No need to provision VMIs or publish simply source code and data.
1http://ivory.idyll.org/blog/vms-considered-harmful.htmlIntroduction Provenance in Software Packages June, 10th, 2014 4 / 29
Problem Scope
Use CDE2 to capture and create portable software package
Extend, partially re-use, and combine CDE packages to create newreproducible software pipelines
Attribute authorship of software packages in new software pipelines
CDE has an OVERLAP conflict!
2Guo, P.J., Engler, D.: CDE: using system call interposition to automatically createportable software packages. USENIX Association, Portland, OR (2011)
Introduction Provenance in Software Packages June, 10th, 2014 5 / 29
CDE
Create a portable software packagewithout installation, configuration, or privilege permissions
Audit mode to create a CDE package
Introduction Provenance in Software Packages June, 10th, 2014 6 / 29
CDE
Create a portable software packagewithout installation, configuration, or privilege permissions
Audit mode to create a CDE package
Introduction Provenance in Software Packages June, 10th, 2014 6 / 29
CDE
Create a portable software packagewithout installation, configuration, or privilege permissions
Audit mode to create a CDE package
Introduction Provenance in Software Packages June, 10th, 2014 6 / 29
CDE
Create a portable software packagewithout installation, configuration, or privilege permissions
Audit mode to create a CDE package
Introduction Provenance in Software Packages June, 10th, 2014 6 / 29
CDE
Create a portable software packagewithout installation, configuration, or privilege permissions
Audit mode to create a CDE package
Introduction Provenance in Software Packages June, 10th, 2014 6 / 29
CDE
Create a portable software packagewithout installation, configuration, or privilege permissions
Audit mode to create a CDE package
Introduction Provenance in Software Packages June, 10th, 2014 6 / 29
CDE
Create a portable software packagewithout installation, configuration, or privilege permissions
Audit mode to create a CDE package
Introduction Provenance in Software Packages June, 10th, 2014 6 / 29
CDE
Create a portable software packagewithout installation, configuration, or privilege permissions
Audit mode to create a CDE package
Introduction Provenance in Software Packages June, 10th, 2014 6 / 29
Software Pipelines Contain CDE packages
A software pipeline consists many individual software modules
A software module depends on externally-developed libraries
A software module is often packaged together with specific versions oflibraries
Introduction Provenance in Software Packages June, 10th, 2014 8 / 29
RDCEP Usecase
Alice, Bob, and Charlie are scientists at the Center for Robust DecisionMaking on Climate and Energy Policy (RDCEP)
A develops data integration methods to produce higher-resolutiondatasets depicting inferred land use over time.
B develops computational models to do model-based comparativeanalysis. B’s software environment consists of A’s software modulesto produce high-resolution datasets.
C uses A and B’s software modules within data-intensivecomputing methods to run them in parallel.
The Center wants to predict future yields of staple agriculturalcommodities given changes in the climate.
C's Package (Merge from B's)
B's Package (from A's)
A's Package
Parallel init Aggregation Generate images Model-based analysis Parallel summary
Generate images Model-based analysisRetrive data Aggregation
Software Pipeline Usecase Provenance in Software Packages June, 10th, 2014 9 / 29
A’s Experiment & Package
A’s packagecde-root
path to A’s filesa-experiment.shretrieve-dataaggregationgenerate-imagef1, f2, a-output
path to common libslibc.so
Re-execute A’s experiment:cde-exec a-experiment.shcat a-experiment.sh
./retrieve-data f1
./aggregation f1 f2
./generate-image f2 a-output
Software Pipeline Usecase Provenance in Software Packages June, 10th, 2014 11 / 29
B’s Experiment & Package
B’s packagecde-root
path to A’s files[...]
path to B’s filesb-experiment.shanalysisb-output
path to common libslibc.so
Re-execute B’s experiment:cde-exec b-experiment.shcat b-experiment.sh
cd path to A’s experimentcde-exec a-experiment.shcd path to B’s files./analysis path to A’s files/a-output b-output
Software Pipeline Usecase Provenance in Software Packages June, 10th, 2014 12 / 29
C’s Experiment & Package
C’s packagecde-root
path to A’s files[...]
path to B’s files[...]
path to C’s filesc-experiment.shparallel-initparallel-summaryc-output
path to common libslibc.so
Re-execute C’s experiment:cde-exec c-experiment.shcat c-experiment.sh
parallel-init path to A’s files/f4cd path to A’s filescde-exec ./aggregation f4 f5cde-exec ./generate-image f5 f6cd path to B’s filescde-exec ./analysis path to A’s files/f6 f7cd path to C’s files./parallel-summary path to B’s files/f7 c-output
Software Pipeline Usecase Provenance in Software Packages June, 10th, 2014 13 / 29
Dependency Overlap in Multiple cde-root Directories
Software Pipeline Usecase Provenance in Software Packages June, 10th, 2014 14 / 29
File Overlap of Different Linux Distributions
RH SUSE U12 U13
Amz 5498 / 23k 3184 / 11k 1203 / 5.4k 1819 / 5.5k
RH 3861 / 12k 1654 / 6.6k 2223 / 6.3k
SUSE 1245 / 3.9k 2085 / 6.4k
U12 8226 / 24k
Table 1 : Ratio of different files having the same path in 5 popular AMIs. Thedenominator is number of files having the same path in two distributions, and thenumerator is the number of files with the same path but different md5 checksum.Ommited are manual pages in /usr/share/ directory.
Amz Amazon Linux AMIRH Red Hat Enterprise Linux 6.4
SUSE SUSE Linux Enterprise Server 11U12 Ubuntu Server 12.04.3 LTSU13 Ubuntu Server 13.10
Software Pipeline Usecase Provenance in Software Packages June, 10th, 2014 15 / 29
Re-direction in Multiple cde-root Directories
Software Pipeline Usecase Provenance in Software Packages June, 10th, 2014 16 / 29
CDE-SP
CDE-SP: Enhanced CDE that includes software provenance
Describe tools and methods to audit, store, and query provenanceProvenance queries
Determine the environment under which a dependency was buildExamine the dependencies which must be presentAnswer if packages in a pipeline can satisfy a new packageAttribute authorship of software packages in a pipeline
Combine and validate authorship from stored provenance
Software Pipeline Usecase Provenance in Software Packages June, 10th, 2014 17 / 29
CDE-SP Audit
Objectives
Capture additional details of the origins of a library or a binary
Use these details for compiling and creating software pipelines
Methods
Create a dependency tree
Process system calls are monitoredWhenever a process executes a file system call, a dependency of thatprocess is recordedDependency can be a data file or a shared library
Extract information about binaries and required shared libraries
file, ldd, strings, and objdump UNIX commandsuname -a and function getpwuid(getuid())
CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th, 2014 18 / 29
CDE-SP Audit
Objectives
Capture additional details of the origins of a library or a binary
Use these details for compiling and creating software pipelines
Methods
Create a dependency tree
Process system calls are monitoredWhenever a process executes a file system call, a dependency of thatprocess is recordedDependency can be a data file or a shared library
Extract information about binaries and required shared libraries
file, ldd, strings, and objdump UNIX commandsuname -a and function getpwuid(getuid())
CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th, 2014 18 / 29
CDE-SP Audit
Objectives
Capture additional details of the origins of a library or a binary
Use these details for compiling and creating software pipelines
Methods
Create a dependency tree
Process system calls are monitoredWhenever a process executes a file system call, a dependency of thatprocess is recordedDependency can be a data file or a shared library
Extract information about binaries and required shared libraries
file, ldd, strings, and objdump UNIX commandsuname -a and function getpwuid(getuid())
CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th, 2014 18 / 29
Storage
Store provenance within the package itself
Use LevelDB: a fast and light-weight key-value storage library
Encode in the key the UNIX process identifier along with spawn time
Key Value Explanationpid.PID1.exec.TIME PID2 PID1 wasTriggeredBy PID2
pid.PID.[path, pwd, args] VALUES Other properties of PID
io.PID.action.IO.TIME FILE(PATH) PID wasGeneratedBy / wa-sUsedBy FILE(PATH)
meta.agent USERNAME User information
meta.machine OSNAME operating system distribution
Table 2 : LevelDB key-value pairs that store file and process provenance. Capital letter words are arguments.
CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th, 2014 19 / 29
Query
LevelDB provides a minimal API for querying
Simple, light-weight query interface
Input: a program whose dependencies need to be retrievedOutput: a GraphViz file displaying file and process dependencies
Use depth first search algorithm to create a dependency tree with theinput program as its root
Exclusion option to remove uninteresting dependencies:/lib/, /usr/lib/, /usr/share/, /etc/
CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th, 2014 20 / 29
Authorship of Software Modules
Combine authorship of the contributing packages
Validate authorship from the provenance stored in the originalpackage
Generate the subgraph associated with the part of the new packageUse subgraph isomorphism (NP-Hard) to validate with the originalprovenance graphMatch provenance nodes of processes with the same paths of theirbinaries and working directoriesMatch provenance nodes of files with the same path
CDE-SP: Software Provenance in CDE Provenance in Software Packages June, 10th, 2014 21 / 29
Experiments
Performance of CDE-SP
Auditing performance overheadDisk storage increaseProvenance query runtime
Redirection overhead when multiple UUID-based directories arecreated
Compare the lightweight virtualization approach of CDE-SP withKameleon3, a heavyweight virtualization approach used forreproducibility
Experiments were run on Ubuntu 12.04 LTS workstation with an 8GBsRAM and 8-core Intel(R) processor clocking at 1600MHz.
3Emeras, J., Richard, O., Bzeznik, B.: Reconstructing the software environment ofan experiment with kameleon (2011)
Experiment and Evaluation Provenance in Software Packages June, 10th, 2014 22 / 29
Performance & Size Overhead
Pipeline with two applications: Aggregation and Generate Image
2.1% slowdown of CDE-SP vs. 0-30% CDE virtualization overhead4
LevelDB database size 236kB (0.03% package size increase) containsapproximately 12,000 key-value pairs
CreatePackage
Execution Disk Usage Provenance Query
CDE 852.6±2.4 568.8±2.4 732MBCDE-SP 870.5±2.5 569.5±1.8 732MB+236kB 0.4±0.03
(seconds) (seconds) (seconds)
Table 3 : Increase in CDE-SP performance is negligible in comparison with CDE
4Guo, P.J., Engler, D.: CDE: using system call interposition to automatically createportable software packages. USENIX Association, Portland, OR (2011)
Experiment and Evaluation Provenance in Software Packages June, 10th, 2014 23 / 29
Redirection Overhead in CDE-SP
Pipelined output of Aggregation to input of Generate Image
3 output files of Aggregation package were moved to Generate Imagepackage
2 cross-package execve() system calls
Less than a 1% slowdown of CDE-SP
Experiment and Evaluation Provenance in Software Packages June, 10th, 2014 24 / 29
Kameleon
Use the Kameleon engine to make a bare bone VM appliance
Self-written YAML-formatted recipesSelf-written macrosteps and microsteps
Kameleon can create virtual machine appliances in different formatsfor different Linux distributions
Generates bash scripts to create an initial virtual image of a LinuxdistributionPopulates the image with more Linux packagesPopulates with content of a CDE-SP package
Experiment and Evaluation Provenance in Software Packages June, 10th, 2014 25 / 29
CDE-SP Vs Kameleon
0
200
400
600
800
1000
1200
1400
1600
Kameleon CDE-SP
Seco
nds
Figure 1 : Overhead when using CDE with Kameleon VM appliance
Experiment and Evaluation Provenance in Software Packages June, 10th, 2014 26 / 29
Related Work
Research Objects: packages scientific workflows with auxiliaryinformation about workflows, including provenance information andmetadata, such as the authors, the version
CDE and Sumatra can capture an execution environment in alightweight fashion
SystemTap, being a kernel-based tracing mechanism, has betterperformance compared to ptrace but needs to run at a higherprivilege level
Provenance-to-Use (PTU) and ReproZip include provenance inself-contained software packages
Related Work Provenance in Software Packages June, 10th, 2014 27 / 29
Conclusion
CDE does not encapsulate provenance of associated dependencies ina software package
The lack of information about the origins of dependencies in asoftware package creates issues when constructing software pipelinesfrom packages
CDE-SP can include software provenance as part of a softwarepackage
CDE-SP can use software package provenance to build softwarepipelines
CDE-SP can maintain provenance when used to construct softwarepipelines
Conclusion Provenance in Software Packages June, 10th, 2014 28 / 29