COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data · 2018-04-03 ·...

20
COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data Timothy McPhillips, Shawn Bowers and Bertram Ludäscher UC Davis Genome Center 7th Biennial Ptolemy Miniconference Berkeley, CA February 13, 2007

Transcript of COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data · 2018-04-03 ·...

Page 1: COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data · 2018-04-03 · COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data ...

COMAD: Collection-Oriented Modeling and Design of Scientific

Workflows and Data

Timothy McPhillips, Shawn Bowers and Bertram LudäscherUC Davis Genome Center

7th Biennial Ptolemy Miniconference

Berkeley, CAFebruary 13, 2007

Page 2: COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data · 2018-04-03 · COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data ...

Why are scientific workflows often so complex and difficult to comprehend?

• Scientific workflows are meant not simply to automate research processes, but also to model them, i.e., make clear what scientific tasks are being automated.

• Many workflows are fairly impenetrable from this point of view.

• Many actors do not represent scientifically meaningful tasks.

• Many actor connections similarly lack scientific significance.

• Structure of the data is modeled implicitly in the workflow definition.

Workflows operating on nested collections of data can be particularly challenging to model.

Page 3: COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data · 2018-04-03 · COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data ...

Nested data collections are everywhere

• Hierarchical structure of nature.

• Scientists (like the rest of us) tend to organize their projects and data in nested folders.

• Many scientific computations create lists of related results.

• Collections signify meaningful associations between data.

Conventional scientific workflow frameworks make it difficult to exploit--or even maintain--such natural organizations of data.

Page 4: COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data · 2018-04-03 · COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data ...

a8

a7

a5

a6

b8

b7

b5

b6

c8

c7

c5

c6

d4

d3

d1

d2

e4

e3

e1

e2

A

B

C

D

E

...

...

...

•Actors generally only accept tokens of particular types.

•Tokens typically are consumed and new ones produced on each actor invocation.

• Many ports, wires, and data management actors are required to maintain associations within data sets.

• Iteration over subsets of the incoming data stream requires additional actors and constructs.

Conventional actors do not maintain associations within data sets

Page 5: COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data · 2018-04-03 · COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data ...

Collection-oriented actors (”coactors”) maintain data integrity by operating like assembly line workers

a3

a2

c1

d2

b1

d1

a1

f1

In Out...

• A collection-oriented actor selects relevant data from its input.

• Analogous to assembly line worker operating on a part of the product being assembled and ignoring the rest.

• Input data and intermediate data products are retained within collections and associated with downstream data products automatically.

• Coactors may be chained together in an intuitive order.

Page 6: COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data · 2018-04-03 · COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data ...

Collectionclosingdelimiter

Flow of tokens in collection-oriented workflows

Collectionopeningdelimiter

Flow of tokens in conventional workflows

Collection-oriented workflows explicitly group data tokens in the data flow

Page 7: COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data · 2018-04-03 · COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data ...

Metadata for collection a

Opening delimiter for top-level collection a

Collections may contain data tokens, metadata tokens, and other collections

Opening delimiter for nested collection b

Opening delimiter for top-level collection c

Metadata for data token d2

• Paired delimiter tokens allow collections to be nested.

• Metadata tokens may be used to annotate entire collections or individual data tokens (e.g., for recording provenance).

Closing delimiter for nested collection b

Page 8: COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data · 2018-04-03 · COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data ...

Actor 5 processes one data token at a time.

Actor 4 processes entire collections (of a particular type) at one time.

Actors 4 and 5 are processing the contents of collection a at the same time.

• Coactors declare what types of collections and data they process via scope expressions.

• Pipelining is automatic even when adjacent actors operate on different data chunk sizes.

• Concurrency is safe.

Collection-oriented workflows automatically pipeline execution of coactors with different scopes

• Coactors may process one token at a time or operate on entire collections at once.

Page 9: COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data · 2018-04-03 · COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data ...

Reads text files Parses Nexus format

Infers maximum likelihoodtrees from DNA sequences

Draws phylogenetictrees

A simple collection-oriented workflow

Lists Nexusfiles to process

Tree drawn by PhylipDrawgram

Note simple layout and intuitive order of actors in this workflow.

Page 10: COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data · 2018-04-03 · COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data ...

Nested collections displayed as XML

Nexus collectionopeningdelimiter

Metadata token

Domain-specific data tokens

Nexus collectionclosingdelimiter

Project collectionclosingdelimiter

Project collectionopeningdelimiter

Page 11: COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data · 2018-04-03 · COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data ...

Iterative execution of coactors

Composite workflow runs PhylipPars iteratively to discover all of the most parsimonious trees.

UniqueTrees discards redundant trees in each collection.

PhylipConsense computes the consensus of all trees discoverd by PhylipPars.

PhylipPars searches for most parsimonious trees using a heuristic employing a random number seed.

Page 12: COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data · 2018-04-03 · COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data ...

Iterating over parameter values and accumulating the results

The default value for the PhylipPars jumbleSeed parameter

This actor adds a parameter token named jumbleSeed with a value of 5 to each incoming Nexus collection.

The jumbleSeed parameter token overrides the default value of the parameter of the same name in PhylipPars.

This actor increments the jumbleSeed parameter so that PhylipPars will discover different trees on each pass.

EndLoop is configured to allow each Nexus collection to exit the loop when a minimum number of equally parsimonious trees has been discovered or a maximum number of iterations has occurred.

Page 13: COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data · 2018-04-03 · COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data ...

• Simplifies design and implementation of new coactors while leveraging existing conventional actors.

• Combines the best of conventional and collection-oriented approaches.

Collection-oriented actors can be constructed from conventional actors

• Composites shown are collection-oriented actors wrapping conventional sub-workflows.

• A special composite actor is used to perform this wrapping.

Page 14: COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data · 2018-04-03 · COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data ...

Composite coactor for composing Nexus files

Composite coactors use the SDF director.

Workflow parameters are exposed as actor parameters in the containing workflow and may be overridden dynamically.

This is a conventional actor, i.e., it is collection-unaware.

The names of input ports specify the types and quantity of data to be selected from passing data collections.

The name of the output port specifies the type and quantity of data to be added to the data stream.

Page 15: COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data · 2018-04-03 · COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data ...

Composite coactor for running Phylip Consense

• Composite coactors provide a safe context for running external programs, catching exceptions, and cleaning up temporary files.

• The environment creation/destruction actors are generic.

Creates a temporary directory for running Consense

Prepares input file required by Consense

Runs Consense in the temporary directory

Reads and parses the file output by Consense

Deletes the temporary directory and files.

Page 16: COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data · 2018-04-03 · COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data ...

Understanding how results are calculatedcan be challenging even for simple workflows

?

Page 17: COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data · 2018-04-03 · COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data ...

Provenance records in a workflow trace

Page 18: COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data · 2018-04-03 · COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data ...

18

TextFileReader:1

NexusFileParser:1

PhylipConsense:1

PhylipPars:1

PhylipPars:3

PhylipPars:5

NexusFileParser:1

PhylipPars:1

PhylipPars:1

PhylipPars:1PhylipPars:1

PhylipConsense:1

PhylipConsense:1

PhylipConsense:1

PhylipConsense:1

PhylipConse

nse:1

Phylip

Conse

nse:

1

• Derivation of a data item in a scientific workflow run may be viewed as a directed acyclic graph.

• Nodes represent data the workflow run operated on or created.

• Each edge points from one data item to a second data item that was directly used in the computation of the first.

• An edge is labeled by the actor invocation that performed this computation.

Data lineage graph for the consensus tree

Page 19: COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data · 2018-04-03 · COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data ...

Summary

• The collection-oriented approach simplifies the design and maintenance of complex scientific workflows operating on nested data sets.

• Composite coactors simplify implementation of collection-oriented actors, leverage existing conventional actors, and facilitate modularity and actor reuse.

• Collection-oriented workflows can easily record the detailed provenance of intermediate and final workflow products.

Page 20: COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data · 2018-04-03 · COMAD: Collection-Oriented Modeling and Design of Scientific Workflows and Data ...

Science Environment for Ecological Knowledge

Real-time Observatories Applications and Data Management Network

Natural DiversityDiscovery Project

Cyberinfrastructure for the Geosciences

SDM Center/Scientific Process Automation

Bertram LudäscherShawn BowersTim McPhillipsNorbert Podhorski

Scientific data management, data integration, scientific workflows, data provenance, & collaboration with domain scientists

Center for Plasma Edge Simulation

Cyberinfrastructure for Phylogenetics Research

SEEK is supported by NSF ITR 022567 pPOD is supported by NSF IIS 0629846, IIS 0630033, and IIS 0629702.

Daniel ZinnAlex Chen Dave ThauYOUR NAME HERE?

Open positions for postdocs and software engineers