USC Viterbi School of Engineering Scientific Workflows and Systems Ewa Deelman.
-
Upload
calvin-turner -
Category
Documents
-
view
214 -
download
1
Transcript of USC Viterbi School of Engineering Scientific Workflows and Systems Ewa Deelman.
USC Viterbi School of Engineering
Scientific Workflows and Systems
Ewa Deelman
USC Viterbi School of Engineering
Outline
• Scientific workflows• Business workflows• Different workflow systems
– Taverna
– Kepler
– Triana
– Askalon
USC Viterbi School of Engineering
Ewa Deelman [email protected]
Applications today
• Complex– Involve many computational steps– Require many (possibly diverse resources)
• Composed of individual application components– Components written by different individuals– Components require and generate large amounts of data– Components written in different languages
• Reuse of individual intermediate data products
• Need to keep track of how the data was produced
USC Viterbi School of Engineering
Workflow Instance
Ewa Deelman, [email protected] www.isi.edu/~deelman pegasus.isi.edu
Collect image
Collect image
Collect image
AdjustColor
AdjustColor
AdjustColor
Co-Addimage
Visualize
…
Image 2
Image 1
Image n
USC Viterbi School of Engineering
Business Workflows
USC Viterbi School of Engineering
Business Workflows
• Designed to compose applications based on web services
• BPEL – Standard language for service interactions
– Has many constructs to deal with the invocation of web services, including fault handling, and support for conditional logic.
USC Viterbi School of Engineering
BPEL constructs
• <receive>: Blocks until a matching message is received. This is typically used to receive a message from the client or a callback from a partner web service.
• <reply>: Send a message in response to a message received via a <receive>
• <invoke>: Perform an invocation on a web service. (one-way or request-response)
• <assign>: Assign a value to a variable. • <sequence>: Executes a list of activities sequentially in
lexical order. • <flow>: Executes the activities in parallel. • <while>: Used for looping until a criteria is true. • <switch>: Select one branch for execution amongst a set of
branches based on a value.
USC Viterbi School of Engineering
Many BPEL engines
• Active bpel• IBM BPEL4J • Oracle BPEL Process Manager • Microsoft Windows Foundation• ….
USC Viterbi School of Engineering
Scientific vs Business Workflows
• Large amounts of data• Varied granularity of
computations• Large number of
computations• Often standalone
components• Non-programmers need to
be able to compose them• Need to provide
provenance info• Performance is important
• Deal with services across domains
• Do not deal with standalone application components
• Usually not very data intensive– Data can be easily sent between
services
• Important to agree on standard interfaces so that MS & IBM can work together
• Focus on functionality/interoperability rather than performance
USC Viterbi School of Engineering
Example of a business workflow
USC Viterbi School of Engineering
• Example of Scientific Workflow
• Workflow Specification Components– Standalone computations
– Designed by different individuals
BgModel
Project
Project
Project
Diff
Diff
Fitplane
Fitplane
Background
Background
Background
Add
Image1
Image2
Image3
USC Viterbi School of Engineering
Different workflow systems
Taverna, a workbench for bioinformatics workflows
Slides courtesy of Katy Wolstencroft
USC Viterbi School of Engineering
The Community Problems
• Everything is Distributed
– Data, Resources and Scientists
• Heterogeneous data • Very few standards
– I/O formats, data representation, annotation
– Everything is a string!
Integration of data and interoperability of resources is difficult
USC Viterbi School of Engineering
Lots of Resources
NAR 2007 – 968 databases
USC Viterbi School of Engineering
Traditional Bioinformatics
12181 acatttctac caacagtgga tgaggttgtt ggtctatgtt ctcaccaaat ttggtgttgt 12241 cagtctttta aattttaacc tttagagaag agtcatacag tcaatagcct tttttagctt 12301 gaccatccta atagatacac agtggtgtct cactgtgatt ttaatttgca ttttcctgct 12361 gactaattat gttgagcttg ttaccattta gacaacttca ttagagaagt gtctaatatt 12421 taggtgactt gcctgttttt ttttaattgg gatcttaatt tttttaaatt attgatttgt 12481 aggagctatt tatatattct ggatacaagt tctttatcag atacacagtt tgtgactatt 12541 ttcttataag tctgtggttt ttatattaat gtttttattg atgactgttt tttacaattg 12601 tggttaagta tacatgacat aaaacggatt atcttaacca ttttaaaatg taaaattcga 12661 tggcattaag tacatccaca atattgtgca actatcacca ctatcatact ccaaaagggc 12721 atccaatacc cattaagctg tcactcccca atctcccatt ttcccacccc tgacaatcaa 12781 taacccattt tctgtctcta tggatttgcc tgttctggat attcatatta atagaatcaa
USC Viterbi School of Engineering
Cutting and Pasting
• Advantages:– Low Technology on both server and client side
– Very Robust: Hard to break.
– Data Integration happens along the way
• Disadvantages:– Time Consuming (and painful!)
• Can be repeated rarely
• Limited to small data sets.
– Error Prone:• Poor repeatability
USC Viterbi School of Engineering
Pipeline Programming
• Advantages– Repeatable
– Allows automation
– Quick, reliable, efficient
• Disadvantages– Requires programming skills
– Difficult to modify
– Requires local tool and database installation
– Requires tool and database maintenance!!!
USC Viterbi School of Engineering
What we want as a solution
A system that is:
• Allows automation• Allows easy repetition, verification and sharing of
experiments• Works on distributed resources• Requires few programming skills• Runs on a local desktop / laptop
USC Viterbi School of Engineering
myGrid as a solution
myGrid allows the automated orchestration of in silico experiments over distributed resources from the scientist’s desktop
Built on computer science technologies of:• Web services• Workflows• Semantic web technologies
USC Viterbi School of Engineering
Workflows
– General technique for describing and enacting a process– Describes what you want to do, not how you want to do it– High level description of the experiment
RepeatMasker
Web service
GenScanWeb Service
BlastWeb Service
USC Viterbi School of Engineering
Workflow language specifies how bioinformatics processes fit together.
High level workflow diagram separated from any lower level coding – you don’t have to be a coder to build workflows.
Workflow is a kind of script or protocol that you configure when you run it.
Easier to explain, share, relocate, reuse and repurpose.
Workflow <=> ModelWorkflow is the integrator of knowledge
The METHODS section of a scientific publication
Workflows
USC Viterbi School of Engineering
Workflow Advantages
• Automation
– Capturing processes in an explicit manner
– Tedium! Computers don’t get bored/distracted/hungry/impatient!
– Saves repeated time and effort
• Modification, maintenance, substitution and personalisation
• Easy to share, explain, relocate, reuse and build
• Releases Scientists/Bioinformaticians to do other work
• Record
– Provenance: what the data is like, where it came from, its quality
USC Viterbi School of Engineering
Taverna Workflow Components
Scufl Simple Conceptual Unified Flow LanguageTaverna Writing, running workflows & examining resultsSOAPLAB Makes applications available
SOAPLABWeb Service
Any Application
Web Service e.g. DDBJ BLAST
USC Viterbi School of Engineering
An Open World
• Open domain services and resources.• Taverna accesses 3000+ services• Third party – we don’t own them – we didn’t build them• All the major providers
– NCBI, DDBJ, EBI …• Enforce NO common data model.
• Quality Web Services considered desirable
USC Viterbi School of Engineering
Adding your own web services
• SoapLab • Java API Consumer
import Java API of libSBML as workflow components
http://www.ebi.ac.uk/soaplab/
USC Viterbi School of Engineering
Shield the Scientist – Bury the Complexity
Workflow enactor
Processor Processor
PlainWeb
Service
Soaplab
Processor
LocalJavaApp
Processor
Enactor
Processor
BioMOBY
Processor
WSRF
Processor
BioMART
Styx
Styxclient
Processor
Rpackage
...
...
Scufl Model
TavernaWorkbench
Workflow Execution
Application
Simple Conceptual Unified Flow Language
USC Viterbi School of Engineering
Kepler
Slides courtesy of Bertram Ludaesher
USC Viterbi School of Engineering
Scientific WorkflowCapture how a scientist works with data and analytical tools
– data access, transformation, analysis, visualization
– possible worldview: dataflow-oriented (cf. signal-processing)
Scientific workflow (wf) benefits (compare w/ script-based approaches) :
– wf automation
– wf & component reuse
– wf design, documentation
– wf archival, sharing
– built-in concurrency
(task-, pipeline-parallelism)
– built-in provenance support
– distributed execution
(Grid) support
– …
USC Viterbi School of Engineering
Ex: SEEK Ecological Niche Modeling Pipeline
• Scientific Workflow paradigm:– Reusable components (“actors”): a
scientist’s verbs/actions – Top-level workflows ≈ conceptual
representation of the science process, sentences in the scientist’s language
– Sub-workflows ≈ increasing levels of detail
• Separation of concerns:– actors: what to do– parameters: configurable behavior– channels: dataflow, pipeline composition– directors: fix execution model, scheduling– semantic types: smart discovery, linking
D Pennington, D Higgins, AT Peterson, M Jones, B Ludaescher, S Bowers. Ecological Niche Modeling using the Kepler Workflow System. Workflows
for e-Science, Springer.
USC Viterbi School of Engineering
Simple Kepler workflow using R (a statistics
package)
Data source from EcoGrid(metadata-driven ingestion)
res <- lm(BARO ~ T_AIR)resplot(T_AIR, BARO)abline(res)
R processing script
USC Viterbi School of Engineering
Convert
Archive
Monitor
Transfer
Plumbing with Style … (Norbert Podhorszki UC Davis, Scott
Klasky ORNL)
• Plasma physics simulation on 2048 processors on Seaborg@NERSC (LBL)– Gyrokinetic Toroidal Code (GTC) to study energy transport in fusion devices (plasma microturbulence)
– Generating 800GB of data (3000 files, 6000 timesteps, 267MB/timestep), 30+ hour simulation run
• Under workflow control:– Monitor (watch) simulation progress (via remote scripts)
– Transfer from NERSC to ORNL concurrently with the simulation run
– Convert each file to HDF5 file
– Archive files to 4GB chunks into HPSS
USC Viterbi School of Engineering
Our Starting Point: Actor-Oriented Modeling
Ports– each actor has a set of input and output ports
– denote the actor’s signature
– produce/consume data (a.k.a. tokens)
– parameters are special “static” ports
USC Viterbi School of Engineering
Actor-Oriented Modeling
Dataflow Connections– unidirectional actor “communication” channels
– connect output ports with input ports
– for composing analysis pipelines
USC Viterbi School of Engineering
Actor-Oriented Modeling
Sub-workflows / Composite Actors– composite actors “wrap” sub-workflows
– like actors, have signatures (i/o ports of sub-workflow)
– hierarchical workflows (arbitrary nesting levels)
USC Viterbi School of Engineering
Actor-Oriented Modeling
Directors– define the execution semantics of workflow graphs
– executes workflow graph (some schedule)
– sub-workflows may have different directors
– promotes reusability
USC Viterbi School of Engineering
Models of Computation (A Wf Engineer’s Issue)
Directors separate the concerns of orchestration and scheduling from conceptual design
– Synchronous Dataflow (SDF)• Statically analyzable: schedule, no deadlocks, fixed buffer requirements; executable
as a single thread by the director.
– Process Networks (PN)• Generalizes SDF. Actors execute as separate threads/processes, with queues of
unbounded size (Kahn/MacQueen networks).
– Directed Acyclic Graph (DAG)• Special case of SDF. No loops, no pipelining.
– Continuous Time (CT)• Connections represent the value of a continuous time signal at some point in time ...
Often used to model physical processes.
– Discrete Event (DE)• Actors communicate through a queue of events in time. Used for instantaneous
reactions in physical systems.
– …
USC Viterbi School of Engineering
Everything is a service / actor…
USC Viterbi School of Engineering
Smart Discovery
Find a component (here: an actor) in different locations (“categories”)
• … based on the semantic annotation of the component (or its ports)
Browse for Components Search for Component Name Search for Category / Keyword
USC Viterbi School of Engineering
Behold the Beauty of Scientific Workflow Design
Author: Kristian Stevens, UC Davis
USC Viterbi School of Engineering
… Shimology Part 2: the ugly truth inside Author: Kristian Stevens, UC Davis
USC Viterbi School of Engineering
Triana
Slides courtesy of Ian Taylor
USC Viterbi School of Engineering
Triana Focus• Two core underlying focuses:
– Interactive graphical programming of the distributed tasks - complex editing
• Intuitive drag/drop flexible editing - copy/paste services, wizards for creating tools/toolboxes, user interfaces, adding nodes and multi-level grouping.
• Has been used as a “graphical editor” for other languages, e.g. DAG, VDLx (DAX in progress).
– Heterogeneous workflows - Bridge the gap between different distributed environments
• Use cross-environment interfaces
• led to integration with GAT (pre SAGA), GAP
USC Viterbi School of Engineering
Types of Uses
– For fine-grained operations, specifying dataflow for local operations
– Or course-grained composition of a distributed workflow
– Or Both - can connect heterogeneous tools (e.g. Web services, Java units, Jxta services) on one workflow
Has been used as a dataflow system, a distributed-workflow environment, workflow-management system, an automated scripting tool, workflow editor.
USC Viterbi School of Engineering
Current Capabilities• Local Java Units
– 600 units in signal, image, audio, text processing, complete math/stats toolboxes etc
– Common units - flexible importers/exporters, graphing, duplicators– Data types - strong data types for a number of domains - includes
run-time checking
• Distributed Integration– GAT - Java GAT implementation - graphical representation of
GAT primitives - supports GRAM, GridFTP, etc– GAP - SOA publish, find, bind triad of operations
• Bindings: Jxta, P2PS, Web Services, WS-RF
– Group unit deployment
• Legacy Applications– Can incorporate legacy applications easy (using local GAT
adaptor) - standard file in/out interface
USC Viterbi School of Engineering
Distributed Work-flow
WorkflowCommands
Workflow, e.g. BPEL4WS
TrianaEngine
TrianaService &
Engine
Remote Legacy
Applications
Distributed services
Distributing Triana Units or Groups (Java)
Integrating Legacy applications into Workflow
Integrating Web Services or P2P Services
GAP
GAT & GAP
GAP
Upperware Middleware
USC Viterbi School of Engineering
Triana, the GAT and the GAP
P2PS JXTAWeb
Services
GAP Interface
UDDISOAP
P2PSDiscovery
P2PSPipes
JXTADiscovery
JXTAPipes
GAT Interface
Condor
Globus RLS
Unicore
PBS GridLab
GRMS
SGESSH
WSRF
LDR
.NET
Other..
GridFTP
Grid Computing:
Job Submission, File services
A Graphical Grid Computing
Environment or Portal
Service Based Computing:
Deployment, discovery and communication with distributed services e.g. P2P and (GSI) Web services
USC Viterbi School of Engineering
Audio Processing (Groups)
USC Viterbi School of Engineering
Group Units
USC Viterbi School of Engineering
GAT Interface
• Main deliverable of Gridlab• Application-level interface• With a set of adapters
– That adapt the interface to an underlying capability
• Versions in C++ and Java• Pre-cursor to SAGA - Simple API for Grid
Applications
USC Viterbi School of Engineering
Grid FTP Adapter
Grid FTP Connection
Jxta File Adapter
Jxta Pipe
GAT Adapters: ExampleGAT Adapters: ExampleGAT API
ResourceManagement
Streaming/Comms
File Management
Job Management
MonitoringCollection
Management
GAT Engine
P2P Environment
Copy File(Machine A, Machine B)
Grid Environment
USC Viterbi School of Engineering
GAP Interface• Motivation by GAT• A Simple Service based API, for
– Service Deployment,– Service Discovery– Pipe Based Communication
• Static application interface with multiple middleware bindings
– P2PS (name…?)– JXTA– Web services
P2PS JXTAWeb
Services
GAP Interface
UDDISOAP
P2PSDiscovery
P2PSPipes
JXTADiscovery
JXTAPipes
USC Viterbi School of Engineering
Deploying and Connecting To Remote Services
• Running services are automatically discovered via the GAP Interface, and appear in the tool tree
• User can drag remote services onto the workspace and connect cables to them like standard tools (except the cables represent actual JXTA/P2PS pipes)
RemoteServices
USC Viterbi School of Engineering
Web Service Discovery
• Triana allows users to query UDDI repositories
• Alternatively, users can import services directly from WSDL
USC Viterbi School of Engineering
Complex Data Types
• Users can build their own interface for creating/mediating between complex types
• Alternatively, Triana can dynamically generate an interface from the WSDL2Java generated bean class
USC Viterbi School of Engineering
Askalon
Slides Courtesy of Thomas Fahringer
USC Viterbi School of Engineering
Goal: simple, efficient, effective application development for the Grid
• Invisible Grid• Application Modeling (UML) and programming at a high level of abstraction (AGWL)• Semantics technologies• Semi-automatic deployment• SOA-based runtime environment with stateful services • Analysis and optimization of performance, costs and reliability
ASKALONASKALONApplication Development andApplication Development andRuntime Environment for the GridRuntime Environment for the Grid
USC Viterbi School of Engineering
WSRFWSRF
ASKALON Workflow Composition and Runtime Environment
Execution
Engine
Execution
Engine
Scheduler
Scheduler
Resource
Manager
Resource
Manager
<agwl> <parallel> activity </parallel></agwl>
<agwl> <parallel> activity </parallel></agwl>
The Grid
Globus toolkitGlobus toolkit
UML-based WorkflowComposition
AGWL Runtime Middleware Services
DataRepositor
y
DataRepositor
y
JobJobPerformanceAnalysis
USC Viterbi School of Engineering
Austrian Grid
karwendel80
CPUs
272 CPUs
altix164 CPUs
16 CPUsCA
UniVie
RAUni-Linz
RAUIBK
MAUI
Uni-Sbg
16 CPUs
MAUI
ZID Grid
gescher
FHVRARA`
hydra
altix116 CPUs
HPC16 CPUs
grid21 CPUs
TorquePBS
SGE
PBS/Torque
SGE
Torque
schafberg 16
CPUsPBS
RA
• 517 CPUs distributed across 5 cities and over 20 parallel computers
Parallel computer # CPU Clock Architecture Location
altix1.jkuhydra.gup
schafberg.sbggrid.fhv.at
gescher.vcpckarwendel.dps
altix1.uibkhc-ma.uibk
zid-grid
6416162132801616
272
ITA2AthlonITA2XeonXeon
OpteronITA2
OpteronP4
1.61.61.633
2.21.62.21.8
ccNUMACOW
ccNUMACOWCOWCOW
ccNUMACOWNOW
LinzLinz
SalzburgVorarlberg
ViennaInnsbruckInnsbruckInnsbruckInnsbruck
USC Viterbi School of Engineering
ASKALON Workflows
• Activity = basic or atomic unit of computation• Activity type
– Functional description of the activity• Signature specified by data input/output ports
– Semantically meaningful name• E.g. matrix multiplication, Gaussian elimination, povray, png2yuv, ffmpeg,
FFT, LAPW, WASIM, …– Implementation-independent
• Workflow = collection of activity types interconnected through control flow and data flow dependencies– Plus some advanced constructs
• Activity deployment– Binds an activity type to a concrete installed implementation– Description how to instantiate the activity– Registered by the application provider in a special registry of the
Resource Management service
USC Viterbi School of Engineering
ASKALON: Abstract Grid Workflow Language (AGWL)
• Atomic activities– abstract from the real implementation, e.g. Web services, legacy applications– Sequential constructs: <sequence>– Conditional constructs: <if>, <switch>
• Basic compound activities– Loop constructs: <while>, <dowhile>, <for>, <forEach>– Directed Acyclic Graph constructs: <dag>
• Advanced compound activities– Parallel section constructs: <parallel>– Parallel loop constructs: <parallelFor>, <parallelForEach>
• Data flow constructs– dataIn/dataOut ports, collections, data repositories, data set distributions, etc.
• Properties– provide hints about the behavior of activities– Predicted I/O data size, computational complexity, non-functional parameters
• Constraints– Optimization metric (e.g. performance, cost, fault tolerance)– Scheduling constraints (e.g. compute architecture, disk, memory)
USC Viterbi School of Engineering
ASKALON Workflow Development Stack
Portal
AGWL
CGWR
Grid
Application Developer
ASKALONMiddleware
Abstract Grid Workflow Language
UMLUML Workflow UML model
XMLXMLActivity Type
JavaJavaActivity Type
ASKALONASKALONActivity Deployment
GridGridActivity Instance
Con
cre
tizing
Concrete Grid Workflow Representation
USC Viterbi School of Engineering
Real-world Scientific Workflows with ASKALON
• WIEN2k
• Material science application
• Technical University of Vienna– Institute of Theoretical Chemistry
• Seven activity types
• Over 500 activity instances
• Statically unknown number of sequential loop iterations
StageIn
LAPW0
LAPW1_K1 LAPW1_K2 LAPW1_Kn...
LAPW2_FERMI
LAPW2_K1 LAPW2_K2 LAPW2_Kn...
Sumpara
Lcore
Mixer
Converged?
StageOut
USC Viterbi School of Engineering
Resource Management
• Resource brokerage– Interface to MDS information service
for resource discovery– Selection based on matchmaking
• Advance reservation– Useful for co-allocation purposes
• GLARE– Registry of activity deployments
• Activity deployment– Binds an abstract activity type to a
concrete implementation– Refers to an installed executable or a
deployed Web/Grid service– Description how to instantiate the
activity– Registered in GLARE by the
application provider
USC Viterbi School of Engineering
Askalon Runtime Environment
Dynamic Bindings of Workflow Abstract - Concrete
Node 1 Nod 2
Node 3
Node 4
Abstract Workflow
Web ServicesExecutables
A
G
AAD
C B
A
B
A B yx
yx
Activity Type (abstract)
Activity Deployment
A B yx A B yx
Concrete Workflow
Resource Manager
USC Viterbi School of Engineering
Composite Activities
• Composite activity– Sequence– Parallel activities– Conditional activities: if, switch– Sequential loops: for, while, for each– Parallel loops: parallel for, parallel for each– Sub-workflows
<sequence name=“seq”> <dataIn name=“in” source=... /> <activity name=“A1”> <dataIn name=“in” source=seq/in /> ... <dataOut name=“out” /> </activity>
<activity name=“A2”> <dataIn name=“in” source=“A1/out” /> ... <dataOut name=“out” /> </activity> <dataOut name=“out” source=“A2/out” /></sequence>
data flowcontrol
flow
A1
A2
Sequence
USC Viterbi School of Engineering
If-then-else
<if> <dataIn ...> <condition> ... </condition> <then> <activity name=“A2”> <dataIn name=“in” source=“...” /> ... <dataOut name=“out” /> </activity> </then>
<else> <activity name=“A3”> <dataIn name=“in” source=“...” /> ... <dataOut name=“out” /> </activity> <else> <dataOut name=“ifout” source=“A2/out,A3/out”></if>
(2)
(4)(3)
A1 A2
A0
A3
(1)then
else
USC Viterbi School of Engineering
Execution Engine• Workflow controller
– Converts XML-based specification (AGWL) to internal representation– Executes the workflow according to control and data flow dependencies
• One separate Controller for every workflow instance
• Event system– Other components can subscribe to the internal events– e.g. logging, controller, tool (WS-Notification), ...
• Logging and database– For post-mortem performance analysis
• GT4 WSRF wrapper– Send WS-Notifications to the portal
Scheduler– Receives jobs ready to
execute from the task loop– Retrieves the resources with
available from GridARM– Assigns the task to the best
machine according to the selection criteria
o Clock speed * no free processors
o Prediction information, memory available, …
Core
Task LoopFault
Handler
ControllerAGWL Interpreter
Event System
GT4 WSRF Service
Logging &
DatabaseSchedul
er
Execution / Launching Framework
GridARM
AGWL