Automatic Generation of Workflow Execution Provenance Roger S. Barga Database Group, Microsoft...

15
Automatic Generation of Automatic Generation of Workflow Execution Workflow Execution Provenance Provenance Roger S. Barga Roger S. Barga Database Group, Microsoft Database Group, Microsoft Research (MSR) Research (MSR)

Transcript of Automatic Generation of Workflow Execution Provenance Roger S. Barga Database Group, Microsoft...

Page 1: Automatic Generation of Workflow Execution Provenance Roger S. Barga Database Group, Microsoft Research (MSR)

Automatic Generation Automatic Generation of of Workflow Workflow Execution Execution ProvenanceProvenanceRoger S. BargaRoger S. BargaDatabase Group, Microsoft Database Group, Microsoft Research (MSR)Research (MSR)

Page 2: Automatic Generation of Workflow Execution Provenance Roger S. Barga Database Group, Microsoft Research (MSR)

My interest in scientific My interest in scientific workflow and provenance…workflow and provenance…In a previous life… In a previous life…

Research Scientist, PNNL, DOE National Research Scientist, PNNL, DOE National LaboratoryLaboratory

• Machine learning, pattern recognition over Machine learning, pattern recognition over large data setslarge data sets• Scientific experiment management system Scientific experiment management system (EMSL)(EMSL)• Electronic laboratory notebook for Electronic laboratory notebook for experiment captureexperiment capture

More recently… More recently… Database Group, Microsoft Research in Database Group, Microsoft Research in Redmond, WARedmond, WA

ImmortalDB (ICDE’06, SIGMOD’06), Event Processing, Phoenix

• Extend commercial software to support Extend commercial software to support scientific researchscientific research

Tailor software for the sciences, provide free of chargeServe as a positive force in the community (Tony Hey)

Practical value, challenging information management research issues…

Page 3: Automatic Generation of Workflow Execution Provenance Roger S. Barga Database Group, Microsoft Research (MSR)

Objectives for this Objectives for this initial effortinitial effort

Provenance capture that is automatic & Provenance capture that is automatic & transparenttransparent

Should persist provenance data for a fixed period of time

Support multiple levels of representationSupport multiple levels of representationWF description Logical log (o & p) deviations step-by-step trace.

Version and lock the executablesVersion and lock the executables

Efficient representation and managementEfficient representation and managementOpportunity to significantly reduce execution Opportunity to significantly reduce execution provenance storage costsprovenance storage costs

An enactment engine for An enactment engine for scientific scientific workflowsworkflows that documents all steps that documents all steps linking original inputs with final linking original inputs with final results so an experiment (execution) results so an experiment (execution) can be verified, reproduced or reruncan be verified, reproduced or rerun

Page 4: Automatic Generation of Workflow Execution Provenance Roger S. Barga Database Group, Microsoft Research (MSR)

Issues NOT considered in our Issues NOT considered in our initial effortinitial effort

Annotations and provenance of the Annotations and provenance of the workflowworkflow

How to include external provenanceHow to include external provenance

Evaluate our prototype on actual Evaluate our prototype on actual scientific workflowsscientific workflows

Provide query and analysis support over Provide query and analysis support over execution provenance traces…execution provenance traces…

Focus on mechanism, implement Focus on mechanism, implement something simple but useful, something simple but useful, consider how to manage this virtual consider how to manage this virtual data productdata product

Provenance capture that is automatic & transparent

Support multiple levels of representation

Version and lock the executables

Efficient representation and management

Page 5: Automatic Generation of Workflow Execution Provenance Roger S. Barga Database Group, Microsoft Research (MSR)

Types of Provenance to Types of Provenance to Capture in Workflow Capture in Workflow ExecutionExecution

Experiment DesignExperiment DesignSerialize the workflow schedule (XOML)Serialize the workflow schedule (XOML)

Invocation RecordInvocation RecordInvocation of specific activities, events and Invocation of specific activities, events and rulesrules

Deviations from the defined schedule Deviations from the defined schedule (shims, etc)(shims, etc)

Interaction ProvenanceInteraction ProvenanceInput variables, runtime parameters, Input variables, runtime parameters, activation inputs activation inputs

External services invoked, return value(s), External services invoked, return value(s), etcetc

Job ProvenanceJob ProvenanceStart/complete time, etcStart/complete time, etc

A workflow schedulesequential, event, rule driven

An ActivityWhat about internal state?What about internal state?

Page 6: Automatic Generation of Workflow Execution Provenance Roger S. Barga Database Group, Microsoft Research (MSR)

Architecture OverviewArchitecture Overview

Query and ManagementInterface (QMI)

Provenance StorageService Interface (PSI)

Workflow ExecutionProvenance

Storage Service(built using CLFS)

Logical Logging Utility

Problem SolvingEnvironment

Workflow EnactmentEngine (WinWF)

Client Query Library

Management Routines

Provenance Services• Trace execution• Difference analysis• Reload runtime state• …

HPC Job Scheduler

HPC Job Scheduler

CreateJOB(XOML)

ExecuteTask(JID, Act)

Page 7: Automatic Generation of Workflow Execution Provenance Roger S. Barga Database Group, Microsoft Research (MSR)

Implementation – Implementation – extending base activity extending base activity

classesclassesActivities are the basic building Activities are the basic building blocksblocks

They are the unit of execution, re-use and They are the unit of execution, re-use and composition composition The The rootroot of of entire workflowentire workflow is itself an is itself an activityactivityComposite activitiesComposite activities contains other contains other activitiesactivitiesEG: Sequence, Parallel, Synchronize, EG: Sequence, Parallel, Synchronize, Exclusive Choice, Merge,…Exclusive Choice, Merge,…Basic activitiesBasic activities are steps within a are steps within a workflowworkflow

Activities are simply classesActivities are simply classesProperties Properties andand events events are introduced to are introduced to intercept and pass control to provenance intercept and pass control to provenance capture service capture service at runtimeat runtime……Each class defines provenance persistence Each class defines provenance persistence methodsmethods that are invoked by the workflow that are invoked by the workflow runtimeruntime

Page 8: Automatic Generation of Workflow Execution Provenance Roger S. Barga Database Group, Microsoft Research (MSR)

Workflow ExecutionWorkflow ExecutionMy Experiment

rt.StartWorkflow(typeof(WF1));

Instance Manager

Persist Provenance

11 App calls StartWorkflow(…)

WF1

Invoke1

22 Instance Manager:• Loads workflow type • Creates instance• Enqueues WF1 with Scheduler

33 Scheduler dequeues WF1, serializes XOML calls Executor(SequentialWorkflow base) which enqueues Sequence

Activity

MyWF.dll

Persist provenance to disk

Execute until idle

Create instance

Execute

Sequence

Save

SequentialWorkflow

Execute

Sequence Execute

OnEvent1

WF1 Instance

WF1

Scheduler

SequenceOnEvent1WF1

44 Dequeue Sequence & calls Executor whichserializes ActRec and enqueues OnEvent1Dequeue OnEvent1, serialize ActRec and call Executor which subscribes to event

55

InstanceMgr calls Flush() on WF1 (Activity base class) to flush provenance records and gets back stream

66

Instance Mgr call Provenance service passing serialized stream – Provenance Storage service saves to disk

77

BaseActivityLibrary

RuntimeEngine

RuntimeServices

Page 9: Automatic Generation of Workflow Execution Provenance Roger S. Barga Database Group, Microsoft Research (MSR)

Transparent Interception and Transparent Interception and Logical LoggingLogical Logging

......

SEQUENCESEQUENCEActivityActivity

WorkflowWorkflowActivity 1Activity 1

WorkflowWorkflowActivity NActivity N

Each activity is creating an operation Each activity is creating an operation history – a time serial stream of history – a time serial stream of provenance records.provenance records.

Each record represents a change in Each record represents a change in operational state, such as sequence operational state, such as sequence advancing, a synchronize or branch being advancing, a synchronize or branch being taken, activities passing data via method taken, activities passing data via method calls.calls.

Replay of the log is an accurate repeated Replay of the log is an accurate repeated history of state changes, up to and history of state changes, up to and including the “present” stateincluding the “present” state

Provenance Service “weaves” these records into the workflow XOML, Provenance Service “weaves” these records into the workflow XOML, recording LSNs for individual activities, insertions (shims), etc. recording LSNs for individual activities, insertions (shims), etc.

Page 10: Automatic Generation of Workflow Execution Provenance Roger S. Barga Database Group, Microsoft Research (MSR)

Host Process

Workflow Foundation

Provenance Capture Integrated Provenance Capture Integrated into Runtime Engine and Servicesinto Runtime Engine and Services

Base Activity Library, classes augmented with provenance capture

My Experiment

Runtime Services• hosting flexibility - pluggable implementations (with defaults)

Provenance Storage (PSI)

Communication Tracking …

Runtime Engine• provides intrinsic behaviors to activities

TrackingInfrastructure

State Management

WorkflowExecution

ProvenanceManagement

Page 11: Automatic Generation of Workflow Execution Provenance Roger S. Barga Database Group, Microsoft Research (MSR)

Query Support (initial)Query Support (initial)Individual Workflow Execution Individual Workflow Execution TraceTrace

Display a graphical trace of the Display a graphical trace of the execution;execution;

Query for skipped steps, inserted Query for skipped steps, inserted steps, etcsteps, etc

Query for the codes (activities) Query for the codes (activities) invoked.invoked.

Query for machine execution statiQuery for machine execution stati

Multiple Workflow Execution Multiple Workflow Execution TracesTraces

Comparative trace (shallow, versus Comparative trace (shallow, versus deep compare)deep compare)

Still “early days” for our query Still “early days” for our query support over a workflow support over a workflow execution provenance trace execution provenance trace storestore

Page 12: Automatic Generation of Workflow Execution Provenance Roger S. Barga Database Group, Microsoft Research (MSR)

An Issue to An Issue to ConsiderConsider……It may not possible to rerun It may not possible to rerun experiment, to either validate or experiment, to either validate or recreate a result because original recreate a result because original workflow is lost (activities have workflow is lost (activities have been updated).been updated).Assign a version identifier (strong Assign a version identifier (strong name) to the workflow assembly so name) to the workflow assembly so it can be associated with the result; it can be associated with the result; only retain if provenance is only retain if provenance is retained. retained.

Updating any activity in the workflow will change this version number, resulting in a new version being created.User is able to rerun the experiment by invoking workflow using fully-specified reference found in the provenance record;

Page 13: Automatic Generation of Workflow Execution Provenance Roger S. Barga Database Group, Microsoft Research (MSR)

Extended Windows Workflow Extended Windows Workflow FoundationFoundation

Transparently capture execution trace Transparently capture execution trace leading to a resultleading to a resultTowards a layered provenance modelTowards a layered provenance modelInitial query facility built over this Initial query facility built over this provenance dataprovenance data

This summer, evaluation and necessary This summer, evaluation and necessary extensions, analysis supportextensions, analysis supportLuciano Digiampietri (UniCamp/Brazil), project Luciano Digiampietri (UniCamp/Brazil), project

internintern

Tying provenance to code Tying provenance to code versioningversioning

In general, how to manage provenance In general, how to manage provenance data and code so the scientist simply data and code so the scientist simply doesn’t have to worry about it…doesn’t have to worry about it…

An interesting data management An interesting data management challengechallenge

Provenance as a first class derived data Provenance as a first class derived data itemitem

To Sum Up…To Sum Up…

Page 14: Automatic Generation of Workflow Execution Provenance Roger S. Barga Database Group, Microsoft Research (MSR)

Closing Comments…Closing Comments…Provenance presents many, many open Provenance presents many, many open questions, but offers so much potential…questions, but offers so much potential…

Execution provenance (sadly) is just the Execution provenance (sadly) is just the tip…tip…Is this even provenance – where to draw the

line?Shall we revel in complexity, or focus on the

low-hanging fruit? Can’t we do both?

Standards (agreements) on Standards (agreements) on representation/protocolsrepresentation/protocolsTry to reach a “tipping point”Try to reach a “tipping point”

Welcome your feedback, suggestions and open to opportunities to collaborate on this problem…

Page 15: Automatic Generation of Workflow Execution Provenance Roger S. Barga Database Group, Microsoft Research (MSR)