Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic...
Transcript of Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic...
1Ilkay ALTINTAS – March 11, 2009
Ilkay ALTINTASDeputy Coordinator for Research, San Diego Supercomputer CenterLab Director, Scientific Workflow Automation Technologies
Accelerating the Scientific Exploration Process with Scientific Workflows
-- with a focus on the Kepler System --
2Ilkay ALTINTAS – March 11, 2009
"Why does this magnificent applied science, which saves work and makes life easier, bring us so little happiness?
The simple answer runs: Because we have not yet learned to make sensible use of it."
– Albert Einstein, in an address at Cal Tech, 1931. (Harper)
3Ilkay ALTINTAS – March 11, 2009
Today’s scientific method
Observe Hypothesize Conduct experiment Analyze data Compare results and Conclude
Predict
+
+
+ +
More to add to this picture: network, Grid, portals, +++
• Observing / Data: Microscopes, telescopes, particle accelerators, X-rays, MRI’s, microarrays, satellite-based sensors, sensor networks, field studies…
• Analysis, Prediction / Models and model execution: Potentially large computation and visualization
Scientific Method: Hypothesis vs. Data
4Ilkay ALTINTAS – March 11, 2009
Scientific workflows emerged as an answer to the need to combine multiple
Cyberinfrastructure components in automated process networks.
So,what is a
scientific workflow?
5Ilkay ALTINTAS – March 11, 2009
The Big Picture: Supporting the Scientist
Conceptual SWF
Executable SWF
From “Napkin Drawings” to Executable Workflows
Fasta File
Circonspect
Average Genome Size
Combine Results PHACCS
6Ilkay ALTINTAS – March 11, 2009
Phylogeny Analysis Workflows
Local Disk
MultipleSequenceAlignment
PhylogenyAnalysis
TreeVisualization
7Ilkay ALTINTAS – March 11, 2009
Promoter Identification Workflow
Source: Matt Coleman (LLNL)Source: Matt Coleman (LLNL)
8Ilkay ALTINTAS – March 11, 2009
PointInPolygonPointInPolygonalgorithmalgorithm
9Ilkay ALTINTAS – March 11, 2009
• Fusion Simulation Codes: (a) GTC; (b) XGC with M3D– e.g. (a) currently 4,800 (soon: 9,600) nodes Cray XT3; 9.6TB RAM; 1.5TB simulation
data/run• GOAL:
– automate remote simulation job submission – continuous file movement to secondary analysis cluster for dynamic visualization &
simulation control – … with runtime-configurable observables
SelectJobMgr
Submit Simulation
Job
Submit FileMover
Job
Overall architect (& prototypical user): Scott Klasky (ORNL)WF design & implementation: Norbert Podhorszki (UC Davis)
CPES Fusion Simulation Workflow
10Ilkay ALTINTAS – March 11, 2009
Pipelined Execution
Model
Reusable Actor
“Class”
SpecializeActor “instances”
SpecializeActor “instances”
Specialized Actor
“Instances”
Inline Documentation
Easy-to-edit Parameter Settings
Inline Display
CPES Analysis Workflow• Concurrent analysis pipeline (@Analysis Cluster):
– convert ; analyze ; copy-to-Web-portal – easy configuration, re-purposing
Overall architect (& prototypical user): Scott Klasky (ORNL)WF design & implementation: Norbert Podhorszki (UC Davis)
11Ilkay ALTINTAS – March 11, 2009
Scientific Workflow Systems• Combination of
– data integration, analysis, and visualization steps – automated "scientific process"
• Mission of scientific workflow systems– Promote “scientific discovery” by providing tools and methods to
generate scientific workflows– Create an extensible and customizable graphical user interface
for scientists from different scientific domains– Support computational experiment creation, execution, sharing,
reuse and provenance– Design frameworks which define efficient ways to connect to the
existing data and integrate heterogeneous data from multiple resources
• Make technology useful through user’s monitor!!!
12Ilkay ALTINTAS – March 11, 2009
Ptolemy II: A laboratory for investigating designKEPLER: A problem-solving environment for Scientific Workflow
KEPLER = “Ptolemy II + X” for Scientific Workflows
Some Popular Scientific Workflow Systems are…
• … and a cross-project collaboration… initiated August 2003
• 1st release: May 13th, 2008• More than 20 thousand downloads!
www.keplerwww.kepler--project.orgproject.org
• Builds upon the open-source Ptolemy II framework
Different Scientific Workflows• Visual component integration
• Taverna, Triana• Grid-base distributed execution
• Pegasus, Askalon• Visualization
• Vistrails, SciRUN• Transaction-oriented
• BPEL, mostly industrial
Execution Platforms• Portals, e.g., GEON• Web 2.0, e.g., myExperiment
13Ilkay ALTINTAS – March 11, 2009
• OOI CI - ORION (ocean observing CI)• LOOKING (oceanography)• CAMERA (metagenomics)• Resurgence (computational chemistry)• ChIP-chip (genomics)• Cheshire Digital Library (archival)• Cell Biology (Scripps)• DART (X-Ray crystallography)• Ocean Life• Assembling theTree of Life project• NEES (earthquake engineering)• ...
Kepler is a Team Effort
– Some CI projects using Kepler:
• SEEK (ecology)• SciDAC (astrophysics, bio, ...)• CPES (plasma simulation, combustion)• GEON (geosciences)• CiPRes (phylogenetics)• ROADnet (real-time data)• Processing Phylodata (pPOD)• REAP (streaming data)• Digital preservation (DIGARCH)• COMET (environmental science)• ITER (fusion)
14Ilkay ALTINTAS – March 11, 2009
Kepler Software Development Practice• How does this all work?
– Joint CVS -- special rules!• Projects like SDM, Cipres, Resurgence have their specialized releases
out of a common infrastructure!– Open-source (BSD)– Website: Wiki -- http: kepler-project.org– Communications:
• Busy IRC channel• Mailing lists: Kepler-dev, Kepler-users, Kepler-members• Telecons for design discussions
– 6-monthly hackatons– Focus group meetings: workshops and conference calls
• How will it all persist?
15Ilkay ALTINTAS – March 11, 2009
How will it all persist?• Development of Kepler C.O.R.E. -- A Comprehensive, Open,
Robust, and Extensible Scientific Workflow Infrastructure• Ludäscher, Altintas, Bowers, Jones, McPhillips, Schildhauer
• Extensibility + Governance + Sustainability• Goals:
• Reliable• refactored build• more modular design• improved engineering practices
• Independently extensible• Open architecture, open project: With improved governance!
16Ilkay ALTINTAS – March 11, 2009
First Kepler Stakeholders Meeting• Organized by Kepler/CORE
• May 13-15, 2008• 35+ stakeholders from 5 countries and 25 projects• Big move towards web execution environments through virtual
laboratories• Different software projects building upon Kepler
• Hydrant (Australia), KFlex (Germany), Nimrod/K (Australia)
• Short-term goals: in addition to a growing set of tools for data processing components
• Enable extension points for:• A customizable workflow authoring user interface (Application + Web)• Moving beyond the desktop environment for the full scientific process• Provenance tracking for workflow design and execution• Extensible access to multiple data repositories• Distributed execution of workflows and computational experiments• Social networks to share and build workflows
17Ilkay ALTINTAS – March 11, 2009
So,what is in Kepler?
18Ilkay ALTINTAS – March 11, 2009
Actors are the Processing Components• Actor
– Encapsulation of parameterized actions – Interface defined by ports and parameters
• Port– Communication between input and output data– Without call-return semantics
• Model of computation– Communication semantics among ports – Flow of control– Implementation is a framework
• Examples– Simulink(The MathWorks)– LabVIEW ( from National Instruments)– Easy 5x (from Boeing) – ROOM(Real-time object-oriented modeling)– ADL(Wright)– …
Actor-Oriented Design
Source: Edward A. Lee, UC Berkeley
19Ilkay ALTINTAS – March 11, 2009
Some actors in place for…• Generic Web Service Client and Web Service Harvester• Customizable RDBMS query and update• Command Line wrapper tools (local, ssh, scp, ftp, etc.) • Some Grid actors-Globus Job Runner, GridFTP-based file access, Proxy Certificate Generator
• SRB support• Native R and Matlab support• Interaction with Nimrod and APST• Communication with ORBs through actors and services• Imaging, Gridding, Vis Support• Textual and Graphical Output• …more generic and domain-oriented actors…
20Ilkay ALTINTAS – March 11, 2009
Directors are the WF Engines that…• Implement different computational models• Define the semantics of
– execution of actors and workflows– interactions between actors
Ptolemy and Kepler are unique in combining different execution models in heterogeneous models!
• Kepler is extending Ptolemy directors with specialized ones for web service based workflows and distributed workflows.
• Process Networks• Rendezvous• Publish and Subscribe• Continuous Time• Finite State Machines
• Dataflow• Time Triggered• Synchronous/reactive model• Discrete Event• Wireless
21Ilkay ALTINTAS – March 11, 2009
Vergil is the GUI for Kepler
• Actor ontology and semantic search for actors• Search -> Drag and drop -> Link via ports• Metadata-based search for datasets
Actor Search Data Search
22Ilkay ALTINTAS – March 11, 2009
Actor Search
• Kepler Actor Ontology• Used in searching actors and creating conceptual views (= folders)
Currently more than 200 Kepler actors added!
23Ilkay ALTINTAS – March 11, 2009
Data Search and Usage of Results• EarthGrid
– Discovery of data resources through local and remote services
SRB, Grid and Web Services, Db connections
– Registry of datasets on the fly using workflows
24Ilkay ALTINTAS – March 11, 2009
Provenance of Workflow Related Data
• Provenance: A concept from art history and library– Inputs, outputs, intermediate results, workflow design,
workflow run
• Collected information – Can be used in a number of ways
• Validation, reproducibility, fault tolerance, etc…
– Linked to the data reseources– Viewable and searchable from outside Kepler
25Ilkay ALTINTAS – March 11, 2009
Running Provenance Recorder
26Ilkay ALTINTAS – March 11, 2009
Provenance Schema
27Ilkay ALTINTAS – March 11, 2009
Current Advances and Users
• Kepler Users:– User interface users
• Workflow developers• Scientists• Software Developers• Engineers• Researchers
– Batch users• Portals• Other workflow systems as an engine
• Collection-oriented workflows• Domain-specific actors• Provenance framework• Semantics support
•annotation, search, workflow validation, integration
• Data and Actor search•EarthGrid data access system•Kepler Component Library
• Kepler Archive (KAR) format•Integrated support for LSID identifiers for all objects
• Object Manager and cache• Web service execution• RExpression & MatlabExpression• Redesigned user interface• Authentication subsystem• Null-value handling• Documentation• Distributed computing support
•NIMROD, Globus, ssh, Master/Slave…
28Ilkay ALTINTAS – March 11, 2009
Kepler System Architecture
Authentication
GUI
Vergil
SMS
KeplerCore
ExtensionsPtolemy
…Kepler GUI Extensions…
Actor&DataSEARCH
TypeSystem
Ext
ProvenanceFramework
KeplerObject
Manager
Documentation
Smart Re-run /Failure
Recovery
29Ilkay ALTINTAS – March 11, 2009
Kepler can be used as a batch execution engine• Configuration phase• Subset: DB2 query on DataStar
Portal
Grid
Subset
Analyze
move process
Visualize
move render display
• Interpolate: Grass RST, Grass IDW, GMT…• Visualize: Global Mapper, FlederMaus, ArcIMS Scheduling/
OutputProcessing
Monitoring/Translation
30Ilkay ALTINTAS – March 11, 2009
Kepler can be executed as a Web Service!
• Invocation– SOAP-based – RESTful
• Executes workflow without the graphical user interface
• Outputs – Files and text mathcing the display related actors of
the executed workflow.
31Ilkay ALTINTAS – March 11, 2009
What have we learned?
Scientific workflow systemsas a research area…
Why workflows? What’s next?
32Ilkay ALTINTAS – March 11, 2009
Advantages of Scientific Workflow Systems• Formalization of the scientific process• Easy to share, adapt and reuse
– Deployable, customizable, extensible• Management of complexity and usability
– Support for hierarchical composition– Interfaces to different technologies from a unified interface– Can be annotated with domain-knowledge
• Tracking provenance of the data and processes– Keep the association of results to processes– Make it easier to validate/regenerate results and processes– Enable comparison between different workflow versions
• Execution monitoring and fault tolerance• Interaction with multiple tools and resources at once
33Ilkay ALTINTAS – March 11, 2009
Evolving Challenges For Scientific Workflows
• Access to heterogeneous data and computational resources and link to different domain knowledge
• Interface to multiple analysis tools and workflow systems– One size doesn’t fit all!
• Support computational experiment creation, execution, sharing, reuse and provenance– Manage complexity, user and process interactivity– Extensions for adaptive and dynamic workflows
• Track provenance of workflow design (= evolution), execution, and intermediate and final results’
• Efficient failure recovery and smart re-runs• Support various file and process transport mechanisms
– Main memory, Java shared file system, …
34Ilkay ALTINTAS – March 11, 2009
Evolving Challenges For Scientific Workflows
• Support the full scientific process– Use and control instruments, networks and observatories in
observing steps– Scientifically and statistically analyze and control the data collected
by the observing steps, – Set up simulations as testbeds for possible observatories
• Come up with efficient and intuitive workflow deploymentmethods
• Do all these in a secure and easy-to-use way!!!
35Ilkay ALTINTAS – March 11, 2009
So,show me an example
CI project that uses Kepler?
36Ilkay ALTINTAS – March 11, 2009
CI Project: REAP
• Funded 2006-2009• NSF CEO:P
• Jones, Altintas, Baru, Ludaescher, Schildhauer
• Partners: • UCSB, SDSC/UCSD, UCDavis,
UCLA, OpenDAP, OSU• Lead institution: NCEAS/UCSB
• Management and Analysis of Observatory Data using Kepler Scientific Workflows
• The vision:• An integrated environment for analyzing data from observatories
http://reap.ecoinformatics.org/
37Ilkay ALTINTAS – March 11, 2009
An End-to-End CI for Observatories
• Overall goal: To bring together, for the first time, seamless access to sensor data from real- time data grids with analytical tools and sophisticated modeling capabilities of scientific workflow environments• Scientist’s view:
1. Access remote real-time and archived data streams • as if they were locally generated!
2. Design and execute scientific workflows • Processing steps: scientific models • Data: raw or derived from sensor networks and data archives
3. Combine data streams in hybrid analytical models• System Engineer’s view:
1. View and monitor observatory infrastructure components2. Model the impacts of system changes before they are executed3. Modify the configuration of the observatory sensors and network
38Ilkay ALTINTAS – March 11, 2009
CI Project: CAMERA
39Ilkay ALTINTAS – March 11, 2009
Current Usecase Focus• Alpha and Gamma Diversity Workflows
– With Forrest Rohwer Lab– Status: Final version executable through CAMERA 2.0
• UCSD Annotation Pipeline (RAMMCAP)– With Weizhong Li– Status: First version executable through CAMERA 2.0
• Phylogenomic and Phylometagenomic Workflow– With Jonathan Eisen Lab– Status: Design in process
40Ilkay ALTINTAS – March 11, 2009
Alpha Diversity Workflow
Fasta File
Circonspect
Average Genome Size
Combine Results PHACCS
User can customize parameters
PHACCS steps in workflow1.Submit job & get job id.2.Check job status3.Fetch & display results
41Ilkay ALTINTAS – March 11, 2009
CAMERA uses results from a 2008 MURPA project!• Transfers images and display them on OptiPortal using Kepler
workflows – by Hoang Nguyen and David Abramson– Socket connection to Optiportal to transfer CGLX server files
• Used in CAMERA project already!
42Ilkay ALTINTAS – March 11, 2009
To Sum Up• … is an open-source system and collaboration • was initiated in August, 2003• grows by application pull from contributors• released 1.0.0 on May 13th, 2008
• There is a lot more to cover and work on…
• More information: http://kepler-project.org
43Ilkay ALTINTAS – March 11, 2009
Ilkay [email protected]+1 (858) 822-5453http://www.sdsc.edu
Thanks!&
Questions…
44Ilkay ALTINTAS – March 11, 2009
SWF Systems Requirements• Design tools-- especially for non-expert users• Ease of use-- fairly simple user interface having more complex features
hidden in the background• Reusable generic features
– Generic enough to serve to different communities but specific enough to serve one domain (e.g. geosciences) => customizable
• Extensibility for the expert user• Registration, publication & provenance of data products and “process
products” (=workflows)• Dynamic plug-in of data and processes from registries/repositories• Distributed WF execution (e.g. Web and Grid awareness)• Semantics awareness• WF Deployment
– as a web site, as a web service,“Power apps” (a la SciRUN II)• Interoperability with other SWF systems