Introduction to the Kepler Workflow System
Matthew B. JonesNational Center for Ecological Analysis and Synthesis (NCEAS)
University of California, Santa Barbara
Software Tools for Sensor NetworksA Workshop sponsored by NCEAS, LTER, and DataONE
May 1-5, 2012
Abstract
• Scientific workflows capture the transformation of data that are produced and consumed by disparate analysis and modeling software systems. Kepler is an open-source system for authoring and executing workflows, providing access to data and services from a variety of networks and systems. By versioning data, workflows, and executions, Kepler allows full reconstruction of the analyses used in scientific papers, even if those analyses are conducted using a variety of commercial and custom software. Kepler promotes reproducible science by allowing users to publish these workflows, data products, and execution traces to remote repositories to be shared with other users.
Diverse Analysis and Modeling
• Wide variety of analyses used in ecology and environmental sciences– Statistical analyses and trends– Rule-based models– Dynamic models (e.g., continuous time)– Individual-based models (agent-based)– many others
• Implemented in many frameworks– implementations are black-boxes– learning curves can be steep– difficult to couple models
Analysis/Modeling Challenges
• Manual process to work with multiple analytical systems
• Data are discovered outside of tools and imported manually
• Difficult to understand models at a glance
• Difficult to revise analyses except in scripted systems
• No accepted way to publish models to share with colleagues
• Little re-use of components – many re-inventions
• Difficult to use multiple computers for one analysis/model– Only a few experts use grid computing
Reproducible Science
• Analytical transparency– open systems– works across analysis packages– documents algorithms completely
• Automated analysis for repeatability– must be scriptable– must be able to handle data dynamically
• Archived and shared analysis and model runs
• Current analytical practices are difficult to manage
• Model the steps used by researchers during analysis– Graphical model of flow of data among processing steps
• Each step often occurs in different software– Matlab, R, SAS, C/C++, Fortran, Swarm, ...– Each component can ‘wrap’ external systems, presenting
a unified view
• Refer to these graphs as ‘Scientific Workflows’
Models as ‘scientific workflows’
Data GraphClean Analyze/Model
A
Source(e.g., data)
C
Sink(e.g., display)
B
Scientific workflows• What are scientific workflows?
– Graphical model of data flow among processing steps
– Inputs and Outputs of components are precisely defined– Components are modular and reusable– Flow of data controlled by a separate execution model– Support for hierarchical models
Processor(e.g., regression)
A
Source(e.g., data)
C
Sink(e.g., display)
B
Scientific workflows• What are scientific workflows?
– Graphical model of data flow among processing steps
– Inputs and Outputs of components are precisely defined– Components are modular and reusable– Flow of data controlled by a separate execution model– Support for hierarchical models
A’
Processor(e.g., regression)
A
Source(e.g., data)
C
Sink(e.g., display)
B
Scientific workflows• What are scientific workflows?
– Graphical model of data flow among processing steps
– Inputs and Outputs of components are precisely defined– Components are modular and reusable– Flow of data controlled by a separate execution model– Support for hierarchical models
A’
Processor(e.g., regression)
B
ED F
• Overview of Kepler• Features
– Data Access– Workflow archiving and sharing– Grid Computing support
• Open source community
Outline
Overview of Kepler
• Goals
• Produce an open-source scientific workflow system• enable scientists to design, share, and execute
scientific workflows
• Support scientists in a variety of disciplines• e.g., biology, ecology, oceanography, astronomy
• Important features• access to scientific data• flexible framework that works across analytical packages• simplify distributed computing using computing grids• clear documentation of analysis and models• effective user interface for workflow design• provenance tracking for results• model archiving and sharing
Kepler use cases represent many science domains
• Ecology– SEEK: Ecological Niche Modeling
– COMET: environmental science – REAP: Parasite invasions using sensor networks
• Geosciences– GEON: LiDAR data processing
– GEON: Geological data integration
• Molecular biology– SDM: Gene promoter identification
– ChIP-chip: genome-scale research
– CAMERA: metagenomics
• Oceanography– REAP: SST data processing– LOOKING: ocean observing CI
– NORIA: ocean observing CI
– ROADNet: real-time data modeling
– Ocean Life project
• Physics– CPES: Plasma fusion simulation
– FermiLab: particle physics
• Phylogenetics• ATOL: Processing Phylodata• CiPRES: phylogentic tools
• Chemistry• Resurgence: Computational
chemistry• DART (X-Ray crystallography)
• Library Science• DIGARCH: Digital preservation• Cheshire digital library: archival
• Conservation Biology• SanParks: Thresholds of Potential
Concerns
Anatomy of a Kepler Workflow
Actors
Channels Ports
Tokens int, string, record{..}, array[..], ..
Kepler scientific workflow system
Kepler scientific workflow system
Data source from repository
Kepler scientific workflow system
Data source from repository
res <- lm(BARO ~ T_AIR)resplot(T_AIR, BARO)abline(res)
R processing script
Kepler scientific workflow system
Data source from repository
res <- lm(BARO ~ T_AIR)resplot(T_AIR, BARO)abline(res)
R processing script
Kepler scientific workflow system
Run ManagementEach execution recordedProvenance of derived data recordedCan archive runs and derived data
A Simple Kepler Workflow
Component Tab
Workflow Run Manager
Searchable Component
List
Component Documentation
Data preparation
Data preparation
FORTRAN code
Data preparation
FORTRAN code
MATLAB code
Data Access
Accessing Data in Kepler
• File system (e.g., CSV files)• Catalog searches (e.g., KNB)• Remote databases (e.g., PostgresQL)• Web services• Data access protocols (e.g., OPeNDAP)• Streaming data (e.g., DataTurbine)• Specialized repositories (e.g., SRB)
• etc., and extensible
Direct Data Access to Data RepositoriesSearch
for metadata term (“ADCP”)
Drag to workflow area to create datasource
398 hits for ‘ADCP’ located in search
OPeNDAP
• Directly access OPeNDAP servers• Apply OPeNDAP constraints for
remote data subsetting
• Current work: searchable catalogs across OPeNDAP servers
Gene sequences via web services
Gene sequences via web services
Web service executes remotely (e.g., in Japan)
Gene sequences via web services
Gene sequence returnedin XML format
Web service executes remotely (e.g., in Japan)
Gene sequences via web services
Web service executes remotely (e.g., in Japan)
Extracted sequencecan be returned forfurther processing
Gene sequences via web services
Web service executes remotely (e.g., in Japan)
This entire workflow can be wrapped as a re-usable componentso that the details of extracting sequence data are hidden unless needed.
Extracted sequencecan be returned forfurther processing
Benthic Boundary Layer Project: Kilo Nalu, Hawaii
Benthic Boundary Layer Geochemistry and Physics at the Kilo Nalu ObservatoryG. Pawlak, M. McManus, F. Sansone, E. De Carlo, A. Hebert and T. Stanton
NSF Award #OCE-0536607-000
• Research instruments are part of cabled-array at the Kilo Nalu Observatory• Deployed off of Point Panic, Honolulu Harbor, Hawai’i• Goal: Measure the interactions between physical oceanographic forcing, sediment alteration, and
modification of sediment-seawater fluxes
Accessing sensor streams at Kilo Nalu
!
!
!
!!
!
! !
!
! !!
!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
!! !
!
!
!
!
!
!
!!
!
!
!
!
!
! ! ! !
!
!!
!!
!! ! !
24.2
024.3
024.4
024.5
0
water temperature
(bottom, 10m ADCP)
Time
Tem
pera
ture
degre
es C
01:00 05:00 09:00 13:00 17:00
Accessing sensor streams at Kilo Nalu
!
!
!
!!
!
! !
!
! !!
!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
!! !
!
!
!
!
!
!
!!
!
!
!
!
!
! ! ! !
!
!!
!!
!! ! !
24.2
024.3
024.4
024.5
0
water temperature
(bottom, 10m ADCP)
Time
Tem
pera
ture
degre
es C
01:00 05:00 09:00 13:00 17:00
Streaming Datafrom observatoryDataTurbine Server
Accessing sensor streams at Kilo Nalu
!
!
!
!!
!
! !
!
! !!
!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
!! !
!
!
!
!
!
!
!!
!
!
!
!
!
! ! ! !
!
!!
!!
!! ! !
24.2
024.3
024.4
024.5
0
water temperature
(bottom, 10m ADCP)
Time
Tem
pera
ture
degre
es C
01:00 05:00 09:00 13:00 17:00
Streaming Datafrom observatoryDataTurbine Server
now <- Sys.time()Epoch <- now - as.numeric(now)timeval <-Epoch + timestampsposixtmedian = median(timeval)mediantime = as.numeric(posixtmedian)meantemp = mean(data)
Support application scriptsin R, Matlab, etc.
Accessing sensor streams at Kilo Nalu
!
!
!
!!
!
! !
!
! !!
!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
!! !
!
!
!
!
!
!
!!
!
!
!
!
!
! ! ! !
!
!!
!!
!! ! !
24.2
024.3
024.4
024.5
0
water temperature
(bottom, 10m ADCP)
Time
Tem
pera
ture
degre
es C
01:00 05:00 09:00 13:00 17:00
Streaming Datafrom observatoryDataTurbine Server
Modular components,easily saved and shared
Accessing sensor streams at Kilo Nalu
!
!
!
!!
!
! !
!
! !!
!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !
!! !
!
!
!
!
!
!
!!
!
!
!
!
!
! ! ! !
!
!!
!!
!! ! !
24.2
024.3
024.4
024.5
0
water temperature
(bottom, 10m ADCP)
Time
Tem
pera
ture
degre
es C
01:00 05:00 09:00 13:00 17:00
Streaming Datafrom observatoryDataTurbine Server
Graphs and derived data can bearchived and displayed
Modular components,easily saved and shared
Composite actors aid comprehension
Composite actors aid comprehension
Composite actors aid comprehension
Composite actors aid comprehension
Composite actors aid comprehension
Composite actors aid comprehension
• Save components • for later re-use
• Share components • via external repositories
Workflow archiving and sharing
Archiving isn’t just for data...
• Kepler can archive and version:
–Analysis code and workflows
–Results and derived data• e.g., data tables, graphs, maps
–Derived data lineage• What data were used as inputs• What processes were used to generate the
derived products
Run Management & Sharing• Provenance subsystem
monitors data tokens
Run Management & Sharing• Provenance subsystem
monitors data tokens
Run Management & Sharing• Provenance subsystem
monitors data tokens
Scheduling remote execution
Viewing remote runs
•
Grid Computing
• Support for several grid technologies– Ad-hoc Kepler networks (Master-Slave)– Globus grid jobs– Hadoop Map-Reduce– SSH plumbed-HPC
Grid computing
Open Source Community
Open Kepler Collaboration
• http://kepler-project.org
• Open-source– BSD License
• Collaborators– UCSB, UCD,
UCSD, UCB, Gonzaga, many others
Ptolemy II
Community Contribution: Kepler/WEKA
from Peter Reutemann
Community Contribution:Science Pipes
from Paul Allen, Cornell Lab of Ornithology
In summary…
• Typical analytical models are complex and difficult to comprehend and maintain
• Scientific workflows provide– An intuitive visual model– Structure and efficiency in modeling and analysis– Abstractions to help deal with complexity– Direct access to data– Means to publish and share models
• Kepler is an evolving but effective tool for scientists– Kepler/CORE award funds transition from research prototype
to production software tool
• Mix analytical systems– Matlab, R, C code, FORTRAN, other executables, ...
• Understand models– visually depict how the analysis works
• Directly access data• Utilize Grid and Cloud computing• Share and version models
– allow sharing of analytical procedures– document precise versions of data and models used
• Provide provenance information– provenance is critical to science– workflows are metadata about scientific process
Advantages of Scientific Workflows
Workflows promote reproducible science
• Scientific Workflows are metadata about process
• Document data analysis and models– provide provenance for data derivation– allows sharing of analytical details
• Publishing and citing workflows supports reproducibility of scientific results
NCEAS’ model for Open Science
From Reichman, Jones, and Schildhauer; doi:10.1126/science.1197962
NCEAS’ model for Open Science
From Reichman, Jones, and Schildhauer; doi:10.1126/science.1197962
NCEAS’ model for Open Science
From Reichman, Jones, and Schildhauer; doi:10.1126/science.1197962
NCEAS’ model for Open Science
From Reichman, Jones, and Schildhauer; doi:10.1126/science.1197962
NCEAS’ model for Open Science
From Reichman, Jones, and Schildhauer; doi:10.1126/science.1197962
Questions?
• http://www.nceas.ucsb.edu/ecoinformatics/
• http://kepler-project.org
Acknowledgments
• This material is based upon work supported by:• The National Science Foundation (9980154, 9904777,
0131178, 9905838, 0129792, and 0225676)• The National Center for Ecological Analysis and Synthesis• The Andrew W. Mellon Foundation.• Kepler contributors: SEEK, REAP, Kepler/CORE, Ptolemy
II, SDM/SciDAC projects• For many shared conversations and a shared vision for
Kepler:– Betram Ludaescher and Tim McPhillips, UC Davis– Ilkay Altintas, UC San Diego– Mark Schildhauer, UC Santa Barbara– Shawn Bowers, Gonzaga University– Christopher Brooks, UC Berkeley
Extra slides
Sensor Network Management
Real-time Environment for Analytical Processing
• Management and Analysis of Environmental Observatory Data using the Kepler Scientific Workflow System
http://reap.ecoinformatics.org/
REAP goals
• For scientists– capabilities for designing and executing complex analytical
models over near real-time and archived data sources
• For data-grid engineers• monitoring and
management capabilities of underlying sensor networks
• For outside users• access to
observatory data and results of models, approachable to non-scientists.
Sensor sites: topology and monitoring
Sensor sites: topology and monitoring
Top Related