Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic...

44
1 Ilkay ALTINTAS Deputy Coordinator for Research, San Diego Supercomputer Center Lab Director, Scientific Workflow Automation Technologies Accelerating the Scientific Exploration Process with Scientific Workflows -- with a focus on the Kepler System --

Transcript of Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic...

Page 1: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

1Ilkay ALTINTAS – March 11, 2009

Ilkay ALTINTASDeputy Coordinator for Research, San Diego Supercomputer CenterLab Director, Scientific Workflow Automation Technologies

Accelerating the Scientific Exploration Process with Scientific Workflows

-- with a focus on the Kepler System --

Page 2: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

2Ilkay ALTINTAS – March 11, 2009

"Why does this magnificent applied science, which saves work and makes life easier, bring us so little happiness?

The simple answer runs: Because we have not yet learned to make sensible use of it."

– Albert Einstein, in an address at Cal Tech, 1931. (Harper)

Page 3: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

3Ilkay ALTINTAS – March 11, 2009

Today’s scientific method

Observe Hypothesize Conduct experiment Analyze data Compare results and Conclude

Predict

+

+

+ +

More to add to this picture: network, Grid, portals, +++

• Observing / Data: Microscopes, telescopes, particle accelerators, X-rays, MRI’s, microarrays, satellite-based sensors, sensor networks, field studies…

• Analysis, Prediction / Models and model execution: Potentially large computation and visualization

Scientific Method: Hypothesis vs. Data

Page 4: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

4Ilkay ALTINTAS – March 11, 2009

Scientific workflows emerged as an answer to the need to combine multiple

Cyberinfrastructure components in automated process networks.

So,what is a

scientific workflow?

Page 5: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

5Ilkay ALTINTAS – March 11, 2009

The Big Picture: Supporting the Scientist

Conceptual SWF

Executable SWF

From “Napkin Drawings” to Executable Workflows

Fasta File

Circonspect

Average Genome Size

Combine Results PHACCS

Page 6: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

6Ilkay ALTINTAS – March 11, 2009

Phylogeny Analysis Workflows

Local Disk

MultipleSequenceAlignment

PhylogenyAnalysis

TreeVisualization

Page 7: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

7Ilkay ALTINTAS – March 11, 2009

Promoter Identification Workflow

Source: Matt Coleman (LLNL)Source: Matt Coleman (LLNL)

Page 8: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

8Ilkay ALTINTAS – March 11, 2009

PointInPolygonPointInPolygonalgorithmalgorithm

Page 9: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

9Ilkay ALTINTAS – March 11, 2009

• Fusion Simulation Codes: (a) GTC; (b) XGC with M3D– e.g. (a) currently 4,800 (soon: 9,600) nodes Cray XT3; 9.6TB RAM; 1.5TB simulation

data/run• GOAL:

– automate remote simulation job submission – continuous file movement to secondary analysis cluster for dynamic visualization &

simulation control – … with runtime-configurable observables

SelectJobMgr

Submit Simulation

Job

Submit FileMover

Job

Overall architect (& prototypical user): Scott Klasky (ORNL)WF design & implementation: Norbert Podhorszki (UC Davis)

CPES Fusion Simulation Workflow

Page 10: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

10Ilkay ALTINTAS – March 11, 2009

Pipelined Execution

Model

Reusable Actor

“Class”

SpecializeActor “instances”

SpecializeActor “instances”

Specialized Actor

“Instances”

Inline Documentation

Easy-to-edit Parameter Settings

Inline Display

CPES Analysis Workflow• Concurrent analysis pipeline (@Analysis Cluster):

– convert ; analyze ; copy-to-Web-portal – easy configuration, re-purposing

Overall architect (& prototypical user): Scott Klasky (ORNL)WF design & implementation: Norbert Podhorszki (UC Davis)

Page 11: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

11Ilkay ALTINTAS – March 11, 2009

Scientific Workflow Systems• Combination of

– data integration, analysis, and visualization steps – automated "scientific process"

• Mission of scientific workflow systems– Promote “scientific discovery” by providing tools and methods to

generate scientific workflows– Create an extensible and customizable graphical user interface

for scientists from different scientific domains– Support computational experiment creation, execution, sharing,

reuse and provenance– Design frameworks which define efficient ways to connect to the

existing data and integrate heterogeneous data from multiple resources

• Make technology useful through user’s monitor!!!

Page 12: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

12Ilkay ALTINTAS – March 11, 2009

Ptolemy II: A laboratory for investigating designKEPLER: A problem-solving environment for Scientific Workflow

KEPLER = “Ptolemy II + X” for Scientific Workflows

Some Popular Scientific Workflow Systems are…

• … and a cross-project collaboration… initiated August 2003

• 1st release: May 13th, 2008• More than 20 thousand downloads!

www.keplerwww.kepler--project.orgproject.org

• Builds upon the open-source Ptolemy II framework

Different Scientific Workflows• Visual component integration

• Taverna, Triana• Grid-base distributed execution

• Pegasus, Askalon• Visualization

• Vistrails, SciRUN• Transaction-oriented

• BPEL, mostly industrial

Execution Platforms• Portals, e.g., GEON• Web 2.0, e.g., myExperiment

Page 13: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

13Ilkay ALTINTAS – March 11, 2009

• OOI CI - ORION (ocean observing CI)• LOOKING (oceanography)• CAMERA (metagenomics)• Resurgence (computational chemistry)• ChIP-chip (genomics)• Cheshire Digital Library (archival)• Cell Biology (Scripps)• DART (X-Ray crystallography)• Ocean Life• Assembling theTree of Life project• NEES (earthquake engineering)• ...

Kepler is a Team Effort

– Some CI projects using Kepler:

• SEEK (ecology)• SciDAC (astrophysics, bio, ...)• CPES (plasma simulation, combustion)• GEON (geosciences)• CiPRes (phylogenetics)• ROADnet (real-time data)• Processing Phylodata (pPOD)• REAP (streaming data)• Digital preservation (DIGARCH)• COMET (environmental science)• ITER (fusion)

Page 14: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

14Ilkay ALTINTAS – March 11, 2009

Kepler Software Development Practice• How does this all work?

– Joint CVS -- special rules!• Projects like SDM, Cipres, Resurgence have their specialized releases

out of a common infrastructure!– Open-source (BSD)– Website: Wiki -- http: kepler-project.org– Communications:

• Busy IRC channel• Mailing lists: Kepler-dev, Kepler-users, Kepler-members• Telecons for design discussions

– 6-monthly hackatons– Focus group meetings: workshops and conference calls

• How will it all persist?

Page 15: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

15Ilkay ALTINTAS – March 11, 2009

How will it all persist?• Development of Kepler C.O.R.E. -- A Comprehensive, Open,

Robust, and Extensible Scientific Workflow Infrastructure• Ludäscher, Altintas, Bowers, Jones, McPhillips, Schildhauer

• Extensibility + Governance + Sustainability• Goals:

• Reliable• refactored build• more modular design• improved engineering practices

• Independently extensible• Open architecture, open project: With improved governance!

Page 16: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

16Ilkay ALTINTAS – March 11, 2009

First Kepler Stakeholders Meeting• Organized by Kepler/CORE

• May 13-15, 2008• 35+ stakeholders from 5 countries and 25 projects• Big move towards web execution environments through virtual

laboratories• Different software projects building upon Kepler

• Hydrant (Australia), KFlex (Germany), Nimrod/K (Australia)

• Short-term goals: in addition to a growing set of tools for data processing components

• Enable extension points for:• A customizable workflow authoring user interface (Application + Web)• Moving beyond the desktop environment for the full scientific process• Provenance tracking for workflow design and execution• Extensible access to multiple data repositories• Distributed execution of workflows and computational experiments• Social networks to share and build workflows

Page 17: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

17Ilkay ALTINTAS – March 11, 2009

So,what is in Kepler?

Page 18: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

18Ilkay ALTINTAS – March 11, 2009

Actors are the Processing Components• Actor

– Encapsulation of parameterized actions – Interface defined by ports and parameters

• Port– Communication between input and output data– Without call-return semantics

• Model of computation– Communication semantics among ports – Flow of control– Implementation is a framework

• Examples– Simulink(The MathWorks)– LabVIEW ( from National Instruments)– Easy 5x (from Boeing) – ROOM(Real-time object-oriented modeling)– ADL(Wright)– …

Actor-Oriented Design

Source: Edward A. Lee, UC Berkeley

Page 19: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

19Ilkay ALTINTAS – March 11, 2009

Some actors in place for…• Generic Web Service Client and Web Service Harvester• Customizable RDBMS query and update• Command Line wrapper tools (local, ssh, scp, ftp, etc.) • Some Grid actors-Globus Job Runner, GridFTP-based file access, Proxy Certificate Generator

• SRB support• Native R and Matlab support• Interaction with Nimrod and APST• Communication with ORBs through actors and services• Imaging, Gridding, Vis Support• Textual and Graphical Output• …more generic and domain-oriented actors…

Page 20: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

20Ilkay ALTINTAS – March 11, 2009

Directors are the WF Engines that…• Implement different computational models• Define the semantics of

– execution of actors and workflows– interactions between actors

Ptolemy and Kepler are unique in combining different execution models in heterogeneous models!

• Kepler is extending Ptolemy directors with specialized ones for web service based workflows and distributed workflows.

• Process Networks• Rendezvous• Publish and Subscribe• Continuous Time• Finite State Machines

• Dataflow• Time Triggered• Synchronous/reactive model• Discrete Event• Wireless

Page 21: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

21Ilkay ALTINTAS – March 11, 2009

Vergil is the GUI for Kepler

• Actor ontology and semantic search for actors• Search -> Drag and drop -> Link via ports• Metadata-based search for datasets

Actor Search Data Search

Page 22: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

22Ilkay ALTINTAS – March 11, 2009

Actor Search

• Kepler Actor Ontology• Used in searching actors and creating conceptual views (= folders)

Currently more than 200 Kepler actors added!

Page 23: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

23Ilkay ALTINTAS – March 11, 2009

Data Search and Usage of Results• EarthGrid

– Discovery of data resources through local and remote services

SRB, Grid and Web Services, Db connections

– Registry of datasets on the fly using workflows

Page 24: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

24Ilkay ALTINTAS – March 11, 2009

Provenance of Workflow Related Data

• Provenance: A concept from art history and library– Inputs, outputs, intermediate results, workflow design,

workflow run

• Collected information – Can be used in a number of ways

• Validation, reproducibility, fault tolerance, etc…

– Linked to the data reseources– Viewable and searchable from outside Kepler

Page 25: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

25Ilkay ALTINTAS – March 11, 2009

Running Provenance Recorder

Page 26: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

26Ilkay ALTINTAS – March 11, 2009

Provenance Schema

Page 27: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

27Ilkay ALTINTAS – March 11, 2009

Current Advances and Users

• Kepler Users:– User interface users

• Workflow developers• Scientists• Software Developers• Engineers• Researchers

– Batch users• Portals• Other workflow systems as an engine

• Collection-oriented workflows• Domain-specific actors• Provenance framework• Semantics support

•annotation, search, workflow validation, integration

• Data and Actor search•EarthGrid data access system•Kepler Component Library

• Kepler Archive (KAR) format•Integrated support for LSID identifiers for all objects

• Object Manager and cache• Web service execution• RExpression & MatlabExpression• Redesigned user interface• Authentication subsystem• Null-value handling• Documentation• Distributed computing support

•NIMROD, Globus, ssh, Master/Slave…

Page 28: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

28Ilkay ALTINTAS – March 11, 2009

Kepler System Architecture

Authentication

GUI

Vergil

SMS

KeplerCore

ExtensionsPtolemy

…Kepler GUI Extensions…

Actor&DataSEARCH

TypeSystem

Ext

ProvenanceFramework

KeplerObject

Manager

Documentation

Smart Re-run /Failure

Recovery

Page 29: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

29Ilkay ALTINTAS – March 11, 2009

Kepler can be used as a batch execution engine• Configuration phase• Subset: DB2 query on DataStar

Portal

Grid

Subset

Analyze

move process

Visualize

move render display

• Interpolate: Grass RST, Grass IDW, GMT…• Visualize: Global Mapper, FlederMaus, ArcIMS Scheduling/

OutputProcessing

Monitoring/Translation

Page 30: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

30Ilkay ALTINTAS – March 11, 2009

Kepler can be executed as a Web Service!

• Invocation– SOAP-based – RESTful

• Executes workflow without the graphical user interface

• Outputs – Files and text mathcing the display related actors of

the executed workflow.

Page 31: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

31Ilkay ALTINTAS – March 11, 2009

What have we learned?

Scientific workflow systemsas a research area…

Why workflows? What’s next?

Page 32: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

32Ilkay ALTINTAS – March 11, 2009

Advantages of Scientific Workflow Systems• Formalization of the scientific process• Easy to share, adapt and reuse

– Deployable, customizable, extensible• Management of complexity and usability

– Support for hierarchical composition– Interfaces to different technologies from a unified interface– Can be annotated with domain-knowledge

• Tracking provenance of the data and processes– Keep the association of results to processes– Make it easier to validate/regenerate results and processes– Enable comparison between different workflow versions

• Execution monitoring and fault tolerance• Interaction with multiple tools and resources at once

Page 33: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

33Ilkay ALTINTAS – March 11, 2009

Evolving Challenges For Scientific Workflows

• Access to heterogeneous data and computational resources and link to different domain knowledge

• Interface to multiple analysis tools and workflow systems– One size doesn’t fit all!

• Support computational experiment creation, execution, sharing, reuse and provenance– Manage complexity, user and process interactivity– Extensions for adaptive and dynamic workflows

• Track provenance of workflow design (= evolution), execution, and intermediate and final results’

• Efficient failure recovery and smart re-runs• Support various file and process transport mechanisms

– Main memory, Java shared file system, …

Page 34: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

34Ilkay ALTINTAS – March 11, 2009

Evolving Challenges For Scientific Workflows

• Support the full scientific process– Use and control instruments, networks and observatories in

observing steps– Scientifically and statistically analyze and control the data collected

by the observing steps, – Set up simulations as testbeds for possible observatories

• Come up with efficient and intuitive workflow deploymentmethods

• Do all these in a secure and easy-to-use way!!!

Page 35: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

35Ilkay ALTINTAS – March 11, 2009

So,show me an example

CI project that uses Kepler?

Page 36: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

36Ilkay ALTINTAS – March 11, 2009

CI Project: REAP

• Funded 2006-2009• NSF CEO:P

• Jones, Altintas, Baru, Ludaescher, Schildhauer

• Partners: • UCSB, SDSC/UCSD, UCDavis,

UCLA, OpenDAP, OSU• Lead institution: NCEAS/UCSB

• Management and Analysis of Observatory Data using Kepler Scientific Workflows

• The vision:• An integrated environment for analyzing data from observatories

http://reap.ecoinformatics.org/

Page 37: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

37Ilkay ALTINTAS – March 11, 2009

An End-to-End CI for Observatories

• Overall goal: To bring together, for the first time, seamless access to sensor data from real- time data grids with analytical tools and sophisticated modeling capabilities of scientific workflow environments• Scientist’s view:

1. Access remote real-time and archived data streams • as if they were locally generated!

2. Design and execute scientific workflows • Processing steps: scientific models • Data: raw or derived from sensor networks and data archives

3. Combine data streams in hybrid analytical models• System Engineer’s view:

1. View and monitor observatory infrastructure components2. Model the impacts of system changes before they are executed3. Modify the configuration of the observatory sensors and network

Page 38: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

38Ilkay ALTINTAS – March 11, 2009

CI Project: CAMERA

Page 39: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

39Ilkay ALTINTAS – March 11, 2009

Current Usecase Focus• Alpha and Gamma Diversity Workflows

– With Forrest Rohwer Lab– Status: Final version executable through CAMERA 2.0

• UCSD Annotation Pipeline (RAMMCAP)– With Weizhong Li– Status: First version executable through CAMERA 2.0

• Phylogenomic and Phylometagenomic Workflow– With Jonathan Eisen Lab– Status: Design in process

Page 40: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

40Ilkay ALTINTAS – March 11, 2009

Alpha Diversity Workflow

Fasta File

Circonspect

Average Genome Size

Combine Results PHACCS

User can customize parameters

PHACCS steps in workflow1.Submit job & get job id.2.Check job status3.Fetch & display results

Page 41: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

41Ilkay ALTINTAS – March 11, 2009

CAMERA uses results from a 2008 MURPA project!• Transfers images and display them on OptiPortal using Kepler

workflows – by Hoang Nguyen and David Abramson– Socket connection to Optiportal to transfer CGLX server files

• Used in CAMERA project already!

Page 42: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

42Ilkay ALTINTAS – March 11, 2009

To Sum Up• … is an open-source system and collaboration • was initiated in August, 2003• grows by application pull from contributors• released 1.0.0 on May 13th, 2008

• There is a lot more to cover and work on…

• More information: http://kepler-project.org

Page 43: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

43Ilkay ALTINTAS – March 11, 2009

Ilkay [email protected]+1 (858) 822-5453http://www.sdsc.edu

Thanks!&

Questions…

Page 44: Accelerating the Scientific Exploration Process with Scientific … · 2016-10-24 · • Generic Web Service Client and Web Service Harvester • Customizable RDBMS query and update

44Ilkay ALTINTAS – March 11, 2009

SWF Systems Requirements• Design tools-- especially for non-expert users• Ease of use-- fairly simple user interface having more complex features

hidden in the background• Reusable generic features

– Generic enough to serve to different communities but specific enough to serve one domain (e.g. geosciences) => customizable

• Extensibility for the expert user• Registration, publication & provenance of data products and “process

products” (=workflows)• Dynamic plug-in of data and processes from registries/repositories• Distributed WF execution (e.g. Web and Grid awareness)• Semantics awareness• WF Deployment

– as a web site, as a web service,“Power apps” (a la SciRUN II)• Interoperability with other SWF systems