Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotations. IEEE BigData...

Post on 19-Jun-2015

550 views 0 download

Tags:

description

Scientific Workflows have become the workhorse of BigData analytics for scientists. As well as being repeatable and optimizable pipelines that bring together datasets and analysis tools, workflows make-up an important part of the provenance of data generated from their execution. By faithfully capturing all stages in the analysis, workflows play a critical part in building up the audit-trail (a.k.a. provenance) meta- data for derived datasets and contributes to the veracity of results. Provenance is essential for reporting results, reporting the method followed, and adapting to changes in the datasets or tools. These functions, however, are hampered by the complexity of workflows and consequently the complexity of data-trails generated from their instrumented execution. In this paper we propose the generation of workflow description summaries in order to tackle workflow complexity. We elaborate reduction primitives for summarizing workflows, and show how prim- itives, as building blocks, can be used in conjunction with semantic workflow annotations to encode different summariza- tion strategies. We report on the effectiveness of the method through experimental evaluation using real-world workflows from the Taverna system.

Transcript of Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotations. IEEE BigData...

Small Is Beautiful: Summarizing Scientific Workflows

Using Semantic AnnotationsPinar Alper, Khalid Belhajjame, Carole A. Goble

University of Manchester

Pinar Karagoz Middle East Technical University

IEEE 2nd International Congress on Big DataJune 27-July 2, 2013

Pinar and her daughter Nile at the end of year school party.

• Data driven analysis pipelines

• Systematic gathering of data and analysis tools into computational solutions for scientific problem-solving

• Tools for automating frequently performed data intensive activities

• Provenance for the resulting datasets

– The method followed

– The resources used

– The datasets used

Scientific Workflows

Science with workflows

GWAS, PharmacogenomicsAssociation study of Nevirapine-induced skin rash in Thai Population

Trypanosomiasis (sleeping sickness parasite) in African Cattle

Astronomy & HelioPhysics

Library Doc

Preservation

Systems Biology of Micro-Organisms

Observing Systems Simulation Experiments

JPL, NASA

BioDiversity Invasive Species Modelling

[Credit Carole A. Goble]

Provenance is paramount for science

• Reporting findings– Derivation - how did we get this result?

• processes/programs used, execution trace, data lineage, source of components (data, services)

– History - who did what when? • creator, contributors, timestamps.

• Adapting to Change– Explanation - why did this record start to appear in

the result? – Change Impact - which steps will be affected if I

change this tool or data input?

PROV Primer, Gil et al

WF Execution TraceRetrospective Provenance: Actual data used, actual invocations, timestamps and data derivation trace

WF Description Prospective Provenance: Intended method for analysis

Workflows can get complex!• Overwhelming for users who are not the

developers

• Abstractions required for reporting

• Lineage queries result in very long trails

Reason and extent of complexity

• a.k.a. Shims

• Dealing with data and protocol heterogeneities

• Local organization of data

Garijo D., Alper. P., Belhajjame K. et al

D. Hull et al

~ 60%

Static Ways To Tackle ComplexityProcess-Wise and Data-Wise

abstractions• Sub-workflows

– Not always a significant unit of function (e.g. aesthetic purposes)

• Bookmarked data links– Cluster the output signature– Further complicates workflow

• Components– Library dependent

Our Solution: Workflow Description Summaries

• A graph model for representing workflows

• Graph re-write rules for summarization

IF <performs certain function> THEN <re-write WF graph>

motifs reduction-primitives

• Domain Independent categorization– Data-Oriented Nature– Resource/Implementation-Oriented

Nature

• Captured In a lightweight OWL Ontology

http://purl.org/net/wf-motifs

PART-1: Scientific Workflow Motifs

A graph model of data-driven workflowsPure Dataflows

W= <N,E>

Operation and Port Nodes

N = (Nop U Np)

Dataflow edges

E = (Eopp U Epp U Epop )

Motif annotations over operations

motifs(color_pathway_by_objects) = {m1:DataRetrieval}

motifs(Get_Image_From_URL_2) = {m2:DataMoving}

DataRetrievalDataRetrieval

DataMovinglDataMovingl

PART-2: Workflow reduction primitives

• Collapse (Up/Down)

• Compose

• Eliminate

Collapse Down

Collapse Up

Compose

Eliminate

How will rules be put to use

• Strategies as a set of rules for summarization

• Two sample strategies based on an empirical analysis of workflows

• Reporting:– Process: Significant activities (Retrieval, Analysis,

Visualization)– Data:

• Reduced cardinality • Stripped of protocol specific payload/formatting

Two sample strategies• By-Eliminate

– Minimal annotation effort – Single rule

• By Collapse– More specific annotation– Multiple rules

Overall Approach

Workflow Designer

Taverna Workbench

Motif Ontology

WF Summary

WF Description

Summarizer

Summarization Rules

Analysis Data Set

• 30 Workflows from the Taverna system• Entire dataset & queries accessible from

http://www.myexperiment.org/packs/467.html

• Manual Annotation using Motif Vocabulary

Summaries at a glance

By-Collapse

By-Elimination

• Causal Ordering of operations

• Reduced depth

By-Collapse

By-Elimination

Mechanistic Effect of Summarization

User Summaries vs. Summary Graphs

Related Work• User Views over provenance O. Biton, et al.

– User specified significant operations– Automatic partitioning of workflow graph.

• Provenance Redaction T. Cadenhead, et al. – Redaction primitives – Graph queries with regular expressions

• Provenance Publishing S. C. Dey et al.

– User policies on publishing (hide, retain)– Consistency checks

Highlights• Re-writing workflow graphs with rules• Exploiting semantic annotations of operations• Controlled, primitive-based re-writing

– Preserve acyclicity

• Users indirectly control the summarization– Encoding their preferences as summary rules

• Querying of Workflow Execution Provenance using summaries.

Future Work

Thank you!

Carole A. GOBLEUniversity of Manchester

Khalid BELHAJJAMEUniversity of Manchester

Pinar KARAGOZMiddle East Technical University

Pinar ALPERUniversity of Manchester

BibliographyD. Garijo, P. Alper, K. Belhajjame, O. Corcho, C. Goble, and Y. Gil. Common motifs in scientific workflows: An empirical

analysis. In the proceedings of the IEEE eScience Conference 2012.

P. Alper, K. Belhajjame, C. A. Goble, and P. Senkul. Enhancing and abstracting scientific workflow provenance for data publishing. Submitted for publication to BIGProv 2013 International Workshop on Managing and Querying Provenance Data at Scale, co-located with EDBT-2013.

O. Biton, et al. Querying and Managing Provenance through User Views in Scientific Workflows. 2008 IEEE 24th International Conference on Data Engineering, pages 1072–1081, Apr. 2008. J.Cheney et al.Provenance in databases: Why, how, and where. Foundations and Trends in Databases, 1(4):379–474, 2009.

D. Hull et al. Treating shimantic web syndrome with ontologies. In AKT Workshop on Semantic Web Services, 2004.

S. C. Dey, D. Zinn, and B. Ludäscher. Propub: towards a declarative approach for publishing customized, policy-aware provenance. In Proceedings of the 23rd international conference on Scientific and statistical database management, SSDBM’11, pages 225–243, Berlin, Heidelberg, 2011. Springer-Verlag.

Y.Gil and S. Miles Editors. The PROV Model Primer http://www.w3.org/TR/prov-primer/ .

S. C. Dey, D. Zinn, and B. Ludäscher. Propub: towards a declarative approach for publishing customized, policy-aware provenance. In Proceedings of the 23rd international conference on Scientific and statistical database management, SSDBM’11, pages 225–243, Berlin

T. Cadenhead, V. Khadilkar, M. Kantarcioglu, and B. Thuraisingham. Transforming provenance using redaction. In Proceedings of the 16th ACM symposium on Access control models and technologies, SACMAT ’11, pages 93–102, New York, NY, USA, 2011. ACM.,

Taverna Open Source and Domain Independent Workflow Management System http://www.taverna.org.uk/