Streaming Workflow Transformation

94
. . Plot . . V plots . Model . . V predictions . . V sum . Σ . Avg . . V avg . Input . . V configuration . w 3 . w 4 . w 4 . w 3 . w 4 . w 1 . w 2 . Streaming Work⅝ow Transformation

Transcript of Streaming Workflow Transformation

Page 1: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page i ��

�� ��

��

.

.

.Plot ..Vplots

.Model ..Vpredictions

..Vsum

.Avg ..Vavg

.Input ..Vconfiguration

.w3

.w4

.w4

.w3

.w4

..Vall

.∪ .w1

.w2

..T1

.T2.

.T3.

.we

.we

.we

.Streaming Work ow

Transformation

Page 2: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page ii ��

�� ��

��

Page 3: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page iii ��

�� ��

��

Streaming Work owTransformation

Master’s esis in Computer Science

Tjalling van der Wal

February

Supervisor:Andreas Wombacher

Page 4: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page iv ��

�� ��

��

Page 5: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page v ��

�� ��

��

Abstract

is thesis presents . a formal model for streaming work ows adaptedfor transformation and . transformation rules for streaming work owsde ned according to that formal model. e validity of the transforma-tion rules is demonstrated by formally proo ng equivalence. e validityof the formal model is demonstrated by the fact that valid transforma-tion rules can be de ned.

Transformation of streaming work ows is the rst step towards au-tomatic optimization of streaming work ows. By providing a formalmodel and transformation rules, this thesis demonstrates that it is pos-sible to build a self-optimizing Streaming Work ow System.

v

Page 6: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page vi ��

�� ��

��

Page 7: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page vii ��

�� ��

��

Contents

Abstract v

Contents vii

List of Figures ix

Introduction. Monitoring a River . . . . . . . . . . . . . . . . . .. Sensor Data Problems and Challenges . . . . . . . .. Streaming Work ow System . . . . . . . . . . . . .. Optimization . . . . . . . . . . . . . . . . . . . . .. Objectives . . . . . . . . . . . . . . . . . . . . . . .. Research Questions . . . . . . . . . . . . . . . . . .. esis Structure . . . . . . . . . . . . . . . . . . . .

Formal Model. Monitoring a River with a Streaming Work ow . . .. Behavior of Activities . . . . . . . . . . . . . . . . .. Assumptions on the System Runtime . . . . . . . . .. Design Choices . . . . . . . . . . . . . . . . . . . .. Position in the System . . . . . . . . . . . . . . . . .

Equivalence. Monitoring a River the Same Way, but Different . . .. Equivalence of Activities . . . . . . . . . . . . . . .. Equivalence of Views . . . . . . . . . . . . . . . . .. Equivalence of Streaming Work ows . . . . . . . . .. Local and Global Equivalence . . . . . . . . . . . . .

vii

Page 8: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page viii ��

�� ��

��

Transformation Rules. Zip Rule . . . . . . . . . . . . . . . . . . . . . . . .. Copycat Rule . . . . . . . . . . . . . . . . . . . . .. Shared Loop Rule . . . . . . . . . . . . . . . . . . .. Copy Elimination/Bypass Rule . . . . . . . . . . . .. A Multi-Step Example . . . . . . . . . . . . . . . .. Monitoring a River Slightly Differently . . . . . . . .

Tuple-based Transformation Rules. Union Associativity Rule . . . . . . . . . . . . . . .. Selection Pushdown Rule . . . . . . . . . . . . . . .. Masquerade Rule . . . . . . . . . . . . . . . . . . .. Other Rules . . . . . . . . . . . . . . . . . . . . . .

Advanced Transformation Rules. Aggregate Split Rule . . . . . . . . . . . . . . . . .. Shared Join Rule . . . . . . . . . . . . . . . . . . . .

Related Work. Complications . . . . . . . . . . . . . . . . . . . . .. Query Graph Transformations . . . . . . . . . . . .. Adaptive Processing . . . . . . . . . . . . . . . . . .. Optimization Criteria . . . . . . . . . . . . . . . . .. Provenance . . . . . . . . . . . . . . . . . . . . . .. Approximation . . . . . . . . . . . . . . . . . . . .

Conclusions. Monitoring a River: Overview . . . . . . . . . . . .. Answers to the Research Questions . . . . . . . . . .. Results and Contribution . . . . . . . . . . . . . . .. Suggested Approach to Building an Optimizer . . . .. Unanswered Questions . . . . . . . . . . . . . . . .. Evaluation . . . . . . . . . . . . . . . . . . . . . . .

Bibliography

viii

Page 9: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page ix ��

�� ��

��

List of Figures

. Optimization . . . . . . . . . . . . . . . . . . . . . . .

. Monitoring a River to Detect Groundwater In uxes . . .

. Types of Window Sequences . . . . . . . . . . . . . . .

. Regular Window De nition . . . . . . . . . . . . . . . .

. Fresh Tuples . . . . . . . . . . . . . . . . . . . . . . . .

. Alignment of Window Sequences . . . . . . . . . . . . .

. Subalignment of Window Sequences . . . . . . . . . . .

. Position of the Formal Model . . . . . . . . . . . . . . .

. Zip Rule . . . . . . . . . . . . . . . . . . . . . . . . . .

. Copycat Rule . . . . . . . . . . . . . . . . . . . . . . .

. Copycat Rule LHS and RHS Combined . . . . . . . . .

. Partitioned Timeline . . . . . . . . . . . . . . . . . . .

. Shared Loop Rule . . . . . . . . . . . . . . . . . . . . .

. Copy Elimination Rule . . . . . . . . . . . . . . . . . .

. Copy Bypass Rule (Righthand Side) . . . . . . . . . . .

. Example of a Multi-step Work ow Transformation . . .

. Union Associativity Rule . . . . . . . . . . . . . . . . .

. Selection Pushdown Rule . . . . . . . . . . . . . . . . .

. Masquerade Rule . . . . . . . . . . . . . . . . . . . . .

. Single Input Union Rule . . . . . . . . . . . . . . . . .

. Idempotent Activities Rule . . . . . . . . . . . . . . . .

. Collapse/Split Select Rule . . . . . . . . . . . . . . . . .

. Intermediate Project Rule . . . . . . . . . . . . . . . . .

. Aggregate Split Rule . . . . . . . . . . . . . . . . . . .

. Computational Costs in the Aggregate Split Rule . . . .

. Shared Join Rule . . . . . . . . . . . . . . . . . . . . . .

. e Ultimate Work ow. . . . . . . . . . . . . . . . . . .

ix

Page 10: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page x ��

�� ��

��

Page 11: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 1 ��

�� ��

��

Chapter One

Introduction

is thesis presents a formal model and transformation rules for stream-ing work ows. In this chapter the motivation, objectives and researchquestions are introduced. First, this chapter introduces an example of astreaming work ow that will be used throughout this thesis.

. Monitoring a River

An environmental scientist wants to devise a system to detect groundwater in uxes that indicate weaknesses in dikes along a river. Part ofhis research involves deploying sensors to measure water temperature,ow and saline levels. e equipment used is advanced and capable of

transmitting the measurements wirelessly to the research centre, some-times as frequent as every second when required.

To detect potential natural hazards as soon as possible, a computersystem at the research centre continuously processes the fresh data andcompares it to con gured thresholds and predicted values. To reducefalse positives, the system utilizes sliding averages and historic data.

e processing done by the system has been de ned and con guredby means of a work ow speci cation. is work ow is called a stream-ing work ow, because it is used to process streams. In a stream of data,the individual data items are related and the order of the data has seman-tic signi cance. (As opposed to business process work ows, in whichthe work items are similar but independent.)

e system also processes a streaming work ow de ned by a sec-ond scientist. By comparing each week of data with the same week inpreceding years, the scientist hopes to prove his own hypothesis on theeffects of climate change on the observed river. Although this work owis less frequently executed, it is nonetheless streaming.

Page 12: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 2 ��

�� ��

��

Both work ows independently calculate averages over the same mea-surements. Luckily the system is smart enough not to calculate the av-erages twice.

. Sensor Data Problems and Challenges

e system from the example does not exist (yet). Section . proposessuch a system. is section provides the motivation for building it.

Problems e utilization of measurements is often limited to the per-son or team that has collected them. at is a shame because collect-ing measurements is expensive. e low utilization of measurements iscaused by the following problems:

• Data is not stored centrally. Measurements are stored on personaldesktops: not available to other researchers or teams; they maynot even know about it. After a researcher leaves the project, themeasurements are not accessible any more.

• Data integration is hard. Conventions differ and detail is missing.e quantity T could be either ground, soil or air temperature.is makes it hard to combine and compare measurements from

different projects.

• Data governance is difficult. After a researcher leaves or after aproject is nished, nobody knows how measurements were col-lected and how they were processed. How frequent did the sensor‘sense’? Is the data raw or has it been processed?

• e processing is manual and ad-hoc. Everyone uses their ownfavorite tool, like Matlab, Excel, SPSS or custom scripts. isexacerbates the other problems.

Each of these problems makes it harder to collect and process sensordata in such a way that measurements can be easily shared and reused.

Challenges Apart from the problems above there are recent develop-ments that could provide new opportunities but could also increase theexisting problems in the future.

Page 13: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 3 ��

�� ��

��

• Although sensor deployments are expensive, individual sensorshave become cheaper. is means more sensors can be deployedat the same cost and more data can be collected. But this alsomeans that the amount of data to manage becomes larger.

• Another challenge is included in the example in the previous sec-tion: streaming data. Instead of arriving periodically in batches,the data arrives as a continuous stream. is means real-timemonitoring is possible, provided a more robust way of managingdata is developed.

. Streaming Work ow System

e solution to these problems and challenges is to build a system suchas already described in Section . . e system will process incomingsensor data according to processing instructions speci ed by the user bymeans of streaming work ows: it is a Streaming Work ow System.

In ‘Composable data processing in environmental science — a pro-cess view.’ [ ] the same problems have been described and a similarsystem is suggested.

Scope of the System Although the example in Section . and someof the problems in Section . are speci c to environmental sciences, theambition is to build a generic system that can be used in many more sit-uations. Examples include administrative systems, monitoring systemsand control systems, in accounting and security domains. Consideringthat a cash register can be seen as a sensor that observes sales, it is logicalnot to limit the Streaming Work ow System to environmental scienceapplications.

Functions and Requirements for the System A Streaming Work owSystem must perform the following functions:

. e system enforces formalization of data processing by meansof streaming work ow speci cations. Formalization is neededfor the system to be able to perform optimization. Additionally,formalization provides governance.

Page 14: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 4 ��

�� ��

��

.. ... Original . ..Alternatives . .

. BestAlternative

.. .. ..Alternatives.search

.search

.choose

.

.this thesis

Figure . : Optimization

. e system shall manage data storage and distribution. If the datais centrally available, it can be shared and reused more easily.

. e system shall optimize the work ow speci cations submittedby users. Optimization shall be done both within and betweenwork ows. Criteria for optimization include computational andstorage cost. Optimization is needed to let the system scale interms of users, data volume and data rate.

ese three requirements are sufficient to tackle the problems and chal-lenges from Section . and to make the example from Section . re-ality. is thesis contributes to the third requirement: optimization ofstreaming work ows.

. Optimization

Optimization is a requirement for the Streaming Work ow System.But what is optimization? And what is needed to build an optimizer?

Optimization Optimization can be described as searching for an al-ternative that is better than the original. Figure . shows the basicprocess. Starting from an original, an optimizer searches for alterna-tives. e optimizer continues by searching for more alternatives based

Page 15: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 5 ��

�� ��

��

on alternatives that have been found earlier. In the end, from all thealternatives found, the best one is chosen.

As example, consider the algebraic formula x = (a + b) ∗ (a + c).By applying algebraic rules the following alternatives can be found: x =

(a+ c) ∗ (a+ b) and x = a ∗ (b+ c). e second alternative is betterbecause it requires less computation to nd x.

Searching for Alternatives e search for alternatives has two prereq-uisites: . a formal way to describe both the original and the alternativesand . a formal de nition to tell whether an alternative is valid. In thealgebraic example the formula itself is a formal description and any for-mula that evaluates to the same value is a valid alternative.

In the work ow example from Section . , work ows are describedby a speci cation. Searching for alternatives is done by applying trans-formation rules to a streaming work ow speci cation to create alterna-tives, just like algebraic rules are used to create alternative formulae. econtribution of this thesis is de ning transformation rules for streamingwork ows.

Choosing an Alternative Which alternative is best depends on thecontext. In case of the algebraic example, the original formula is the bestwhen trying to solve for x = 0, but the alternative x = a∗ (b+c) is thebest when computing the value. e selection of the ‘best’ alternativestreaming work ow speci cation is outside the scope of this thesis.

. Objectives

is thesis tries to contribute to the idea of a Streaming Work ow Sys-tem by showing the feasibility of building such a system, in particularthe feasibility of creating an optimizer for streaming work ows. eobjectives of this thesis are:

. Validate the idea of a Streaming Work ow System by showingit is theoretically possible to perform automatic optimization ofuser-de ned streaming work ow speci cations.

Page 16: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 6 ��

�� ��

��

It is assumed that optimization is possible if transformation ispossible. erefore this thesis is limited to streaming work owtransformation.

. De ne a formal model for streaming work ows with transforma-tion in mind. An existing formal model for streaming work owsis used as a starting point.

. De ne transformation rules for both generic and speci c activi-ties, and for a reasonable set of common activities (including re-lational).

. Research Questions

To ful ll the objectives, this thesis will answer the following researchquestions:

. How must the (existing) formal model be adapted to allow fortransformation?

. What is a suitable de nition of equivalence between work ows?

. Can transformation rules be de ned for common types of activi-ties?

a) Can rules be de ned for generic situations?b) Can rules be de ned for common (relational) activities?c) Can rules be de ned for aggregates and join activities?

By answering these research questions, the second and third objectivefrom Section . are ful lled. e rst objective is indirectly ful lled byful llment of the second and third objective.

. esis Structure

e structure of this thesis follows the research questions. In Chapterthe formal model is described and in Chapter equivalence of work owsis de ned. ese two chapters are prerequisite for the next chapters.

Page 17: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 7 ��

�� ��

��

In Chapters , and transformation rules valid within the formalmodel are de ned. ese chapters can be read selectively, but in partic-ular the Copycat Rule (Section . ) should be read because for this rulea full proof is provided that uses and illustrates most concepts from theformal model.

Section . presents an example that shows how two independentbut similar streaming work ows can share computation through theapplication of multiple transformation rules. Section . shows howtransformation rules can be used to optimize the ‘Monitoring a River’example as the work ow evolves during a research project.

In Chapter related work with regard to streaming work ows andcontinuous querying is discussed. In Chapter the results of this re-search are evaluated against the objectives and in Section . a suggestedapproach for actually building an optimizer for a Streaming Work owSystem is presented.

Page 18: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 8 ��

�� ��

��

Page 19: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 9 ��

�� ��

��

Chapter Two

Formal ModelBefore a streaming work ow can be transformed, a formal description ofthe work ow is required. e description must be formal because . thedescription must be manipulatable for a computer system and . it mustbe possible to proof the equivalence between work ows. is chapterintroduces the formal model used to describe streaming work ows.

In the rst section the ‘Monitoring a River’ case from the introduc-tion is used as an example to introduce the core concepts of the formalmodel. In Section . the behavior of a streaming work ow is formal-ized. In Section . assumptions on the system’s runtime are listed.

ese are needed to make the model work and to complete the proofsof transformation rules. Section . discusses design choices that havein uenced the formal model.

Origins e formal model presented in this chapter is based on theformal model described by Wombacher in ‘Data Work ow — A Uni-ed Time Controlled E-Science Processing Model’ [ ]. Compared

to the original model, this new model trades expressive power for thepredictability needed to do transformations.

It is important to note that although the formal model is less expres-sive, it does not mean that a Streaming Work ow System would be lesspowerful. More powerful constructs can be offered by the system, withthe limitation that not all of those can be transformed and optimizedusing the formal model presented here.

. Monitoring a River with a Streaming Work ow

Section . introduced the example of a scientist who tries to protectdikes by detecting groundwater in uxes. e data processing needed todetect in uxes has been speci ed by the scientist by means of a stream-ing work ow. is section explains how a streaming work ow works

Page 20: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 10 ��

�� ��

��

.

.

...Tem

peratureSensor1

... .VT

..

..

..

..

..

...

... .VT

...Avg

... .VT

...Tem

peratureSensorn

... .VT

..

..

..

...In

uxDetector

....V

influxes

...Flow

Sensor1

...

..

...Avg

.....

.Sum...

..

...Subtract

....V

alarms

...Flow

Sensorn

...

..

..

..

...D

ataEntry...

.Vm

3/minute

.Vm

3/minute

.Vknow

nin

uxlocations

.Vm

3/hour

.Vm

3/hour

. .sensors

. .lteringand

aggregation

Figure.

:Monitoring

aRivertoD

etectGroundwaterIn

uxes

Page 21: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 11 ��

�� ��

��

Activity Output (content ofview)

Purpose

...Activity ... .

Temperature Sensor T in ◦C at locationx at time t

Observe river

Flow Sensor Flow in m3/s at lo-cation x in time in-terval t

Observe river

T in ◦C at locationx at time t

Eliminate valuesoutside per-centile to addresssensor errors

Avg(T ) Average T in ◦C attime t over last hour

Calculate averagetemperature in river

Avg(Flow) Average ow inm3/s in timeinterval t

Use average as‘weighted majorityvote’ to reduceimpact of sensorerror

Sum(Flow) Total ow inm3/hour at time t

Calculate total owin last hour

In ux Detector Detected in uxes atlocation x duringtime interval t

Does what thiswork ow was cre-ated for; all theother activities existto feed data to thisactivity.

Data Entry List of known in uxlocations

Manually record lo-cations with knownissues or special cir-cumstances.

Subtract List of alarms with-out false positives

Eliminate knownin ux locationsfrom list of in uxes.

Table . : Description of Work ow Components in Figure .

Page 22: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 12 ��

�� ��

��

using this example.

To detect in uxes, the scientist has written an algorithm that looks foranomalies in the combination of water temperature and water volume.Before sensor data can be fed to this algorithm, the data has to be l-tered to address sensor errors and has to be aggregated. After the in uxdetection, known locations have to be ltered out to produce a list oflocations on the dike that need inspection. So the work ow has thefollowing steps: . record the data, . lter the data, . aggregate thedata . execute detection algorithm, . record known in ux locationsand . eliminate false alarms.

Figure . shows the streaming work ow created by the scientist. Table. details the purpose of each individual component in the work ow.

Together these components perform the six required steps.e work ow starts with the sensors deployed in the river; period-

ically they send data to the research centre. In the formal model ev-erything that produces or processes data is an activity. e activitiesrepresenting the sensors are shown on the left of the work ow as blueboxes. Associated with each activity is a view that stores all data pro-duced by an activity. In the work ow, views are depicted as small greencircles.

e recording of sensor data is represented by the sensor activitiesand their corresponding views. e second step to perform is to lterthe data to reduce the impact of sensor error. For temperature data, a

activity periodically reads the data from all the sensor views andproduces a new stream from which outliers have been removed. For theow data, the average over all sensors is calculated by an Avg activity.e output from both activities is again stored in views.

e third step, aggregation, is similar to the ltering: an activityreads the data from a view, applies the required computation and pro-duces a stream which is stored in a new view. is results in two views:one with a sliding average temperature and one with a sliding total owvolume.

e custom in ux detection algorithm is the fourth step and is alsoan activity. It reads the aggregated data and produces a stream of lo-cations with an anomaly in the combination of water temperature andwater volume.

Page 23: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 13 ��

�� ��

��

As a sixth and nal step the list of locations is ltered by means of amanually maintained list that is produced by step ve.

Two phrases in the description above have been deliberately left vague:‘periodically’ and ‘reads the data’. is raises two questions: when doesan activity execute? and which data is used for an execution?. e nextsection will answer those questions.

. Behavior of Activities

is section describes the behavior of activities. Activities are the activecomponents of a streaming work ow, as opposed to views that just storedata and thus have no behavior. e behavior of activities is de ned bytwo components: . when to execute and . what input data to use. Bothare speci ed using window sequences.

De nition : Window Sequence A window sequence is a regular andpredictable sequence of windows on a view. Each window representsa single activation of an activity and the selected data from the view touse for that activation. A window sequence is either de ned in terms oftime or in terms of arriving tuples.

When an activity must be activated (or executed) is determined byan arithmetic sequence of timestamps or tuple-ids. Elements of such asequence are called reference points.

De nition : Reference Points e reference points of an activity arean arithmetic sequence generated by [epoch + n ∗ delta | n ∈ N0]. Fortime-based activities epoch and delta must both be valid time units. Fortuple-based activities both must be tuple counts.

An activity can have a different window sequence for each input re-lation, but they all share the same reference points.

For each reference point the work ow system activates the activity.e system feeds the activity with data selected by the current window

and the activity appends zero or more output tuples to the output view.

Page 24: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 14 ��

�� ��

��

.. xed .view

...

..landmark .view

.n

....

..

..partitioning .view

.n

....

....

..sliding .view

.n

....

....

..hopping .view

.n

....

..

Figure . : Types of Window Sequences. e horizontal axis repre-sents the data in the view and the bars are window instances, indicatingthe selected data. e vertical axis displays the n used to generate thereference points sequence [epoch + n ∗ delta | n ∈ N0].

Page 25: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 15 ��

�� ��

��

What data must be selected from input views is de ned relative to thereference point of the current activation. ree basic types of windowsequences are supported.

De nition : Regular Window Sequence For regular windows, thelower and upper bound of the data interval selected from a view is de-ned relative to the reference point. e interval de nition is: ⟨reference

point − (ws + offset) . . . referencepoint − offset]. ws is the window size ofthe window and offset allows to specify the distance of the window tothe reference point. Figure . visualizes this.

ere are exactly three subtypes of regular window sequences basedon the relation between the window size (ws) and the distance that thereference point shifts between each window (delta). ese three sub-types are partitioning (ws = delta), sliding (ws > delta) and hopping(ws < delta). Figure . shows them.

A partitioning window sequence has two speci c properties: all win-dows are disjoint: ∀i, j : wi ∩ wj = ∅ (mutually exclusive) and allwindows together cover the entire view:

∩wi = V (collectively ex-

haustive). A hopping window sequence ignores part of the data in theinput view.

.

.ws .offset

. lowerbound

. upperbound

.referencepoint

Figure . : Regular Window De nition

De nition : Landmark Window Sequence For landmark windowsthe lower bound is xed and the upper bound is de ned in the same wayas for regular windows. e interval de nition is: ⟨landmark . . . referencepoint − offset].

De nition : Fixed Windows For xed windows the interval is de-ned with xed lower and upper bounds. Fixed windows are used for

one-time queries and to incorporate historic data.

Page 26: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 16 ��

�� ��

��

De nition : Window (∼ Instance) A single window from a windowsequence; obtained by substitution of the reference point of the currentactivation into the window sequence de nition. It serves as the data se-lection interval for the current activation. e lower bound of a windowis exclusive and the upper bound is inclusive.

De nition : Standard Tuple-based Window Sequence For tuple-based activities a special window sequence is de ned. e standard tu-ple based window sequence we for event based activities is de ned asdelta = 1, ws = 1, offset = 0.

When activated, an activity is not just called with a bag of inputtuples. e work ow system also reports the interval used and all thereference point sequence and window sequence information. Union andJoin activities need this information to distinguish between fresh tuplesand tuples they ‘have seen before’.

Knowing the speci cation of the window of the current activation,an activity can calculate the bounds of the previous window. e lattercan be used to determine which tuples ‘have been seen before’ and whichtuples are fresh.

De nition : Fresh Tuples A fresh tuple is a tuple that was not in-cluded in an earlier window instance. Figure . shows windows froma sliding window sequence with fresh tuples marked orange. In parti-tioning and hopping windows, all tuples are fresh.

..sliding .view

.n

....

....

....

..

.n = 0

Figure . : Fresh Tuples in a Sliding Window Sequence. Fresh tuplesshown in orange.

Page 27: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 17 ��

�� ��

��

..w1

.w2

Figure . : Alignment of Window Sequences

..w1

.w2

Figure . : Subalignment of Window Sequences

An important property of reference points and window sequences isalignment. is property is used in proofs in Chapter .

De nition : Window Sequence Alignment Two window sequencesw1 and w2 align iff delta(w1) = delta(w2) and ∃n : epoch(w2) =

epoch(w1) + n ∗ delta(w1).Because n is allowed to be negative, alignment is an equivalence rela-

tion: it is re exive, symmetrical and transitive. Reference to alignmentof activities is shorthand for alignment of their respective window se-quences.

De nition : Window Sequence Subalignment Two window se-quences w1 and w2 subalign iff ∃k ∈ N1 : k ∗ delta(w1) = delta(w2)

and ∃n ∈ N : epoch(w2) = epoch(w1) + n ∗ delta(w1).Window subsequence alignment captures the case where one win-

dow sequence is a subsequence of another window sequence. For ex-ample, an average over a period of a week that is calculated each week,is a subsequence of an average over a seven day period that is calculatedeach day.

Page 28: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 18 ��

�� ��

��

. Assumptions on the System Runtime

To make the formal model function, several assumptions must be madeon the Streaming Work ow System runtime.

Assumption I: e System is In nitely Fast. e model has a for-mal and mathematical nature. Execution time is not something thatconcerns the model. erefore the model assumes that execution of ac-tivities is fast; so fast that execution takes zero time. is assumptionhas a very nice effect on the notion of equivalence: since execution isin nitely fast, intermediate states of the system will not be observableand will thus not affect equivalence.

Assumption II: e System Ensures Correct Execution Order. Whenmore than one activity is enabled for the same reference point, the run-time of the Streaming Work ow System will ensure that an executionorder is chosen that respects the input-output dependencies betweenthe activities. When an activity is executed it is guaranteed that all in-put views are complete for all reference points less or equal the currentreference point.

e system may not be able to ful ll this assumption when the work-ow speci cation contains cycles. is thesis ignores cyclic subgraphs.

Cyclic subgraphs must be treated as a single non-transformable activity.

Assumption III: e Initial System is Empty. e work ow modeldoes not allow orphaned views. As a result any data in the system musthave been added to the system through an activity. is includes historicdata and (con guration) data that is added manually. is assumptionis needed to proof the correctness of transformations.

. Design Choices

is section details a number of choices made while designing the for-mal model. Unlike the assumptions from the previous section, thesechoices are not needed to make the formal model function. ey dohowever have had their impact on the formal model.

Page 29: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 19 ��

�� ��

��

Design Choice I: Streaming and Querying are the Same. Wombacher[ ] makes a distinction between a streaming mode of operation and aspecial query mode. is model chooses to treat them the same. Toillustrate this in a more technical manner: Suppose the system is like afunctional programming language and the sensor data is stored in lists.To process the data, you need mapping and folding functions. But:these functions can be applied to both nite and in nite lists and alsoto lists that are still being generated. For the functions there is no dif-ference between stream processing fresh data and batch processing his-toric data. Likewise the streaming work ow model does not distinguishstreaming and querying modes.

e reason for this choice is that it allows the formal model to be ag-nostic of the current time; epoch and other time variables are not limitedto Now and the future but can also be in the past.

It is personal belief that this makes the model cleaner and less com-plex but does not needlessly complicate the implementation of the sys-tem runtime.

Design Choice II: Triggers are Ugly. In [ ], triggers are used to de-termine when activities should be executed. Triggers are not declarativeand their semantics are hard to capture, especially the fact ‘ week’ doesnot mean the same as ‘ days’. Also a trigger alone can not be ma-nipulated by transformations because it does not describe what data anactivity needs for an activation. erefore in this model triggers havebeen replaced by window sequences.

DesignChoice III:Work ows are Immutable. Once submitted to thesystem by the user, a work ow is immutable. When a work ow needs tobe updated, the user must submit a new work ow to the system. When-ever is said that a work ow is updated, it means that a new work owderived from the original is added to the system.

e rationale for immutable work ows is data governance: becausework ows cannot change, it is always possible to tell how a certain tuplewas generated.

e implication of immutable work ows is that many work ows ex-ist in the system that are only slightly different from each other, oftenin trivial ways. erefore there is much more opportunity for sharedcomputation than apparent.

Page 30: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 20 ��

�� ��

��

Design Choice IV: All Activities are Stream Deterministic. Wom-bacher [ ] distinguishes deterministic and non-deterministic activities.In this thesis, this distinction is ignored and all activities are consideredto be stream deterministic: given the same data stream, a just initial-ized instance of an activity always produces the same output stream.Whether an activity carries internal state across subsequent activationsis irrelevant for this type of determinism.

e kind of determinism described in [ ] is much stronger as it isde ned in terms of individual activations, theoretically allowing out-of-order execution. is sort of optimizations is outside the scope of thisthesis.

Stream determinism does require that the output of activities solelydepends on the current and previous inputs. e results should not de-pend on (in particular) the actual execution time. is gives the systemthe exibility to delay execution, but also means that transformationscan cause earlier execution of activities without breaking equivalence.

Sensors and data entry activities in a work ow are not stream deter-ministic. is is not a problem because these activities are the leafs ofthe work ow graph and can not be eliminated anyway for that reason.

. Position in the System

is section describes the position of the formal model within a Stream-ing Work ow System. Describing what the model is and, in particular,what it is not, has helped me considerably to think about transformationrules in isolation.

e Model is Not a User Interface. e formal model is not intendedas a user interface. A real system will provide a higher level of abstrac-tion. Because the model is not a user interface, seemingly arbitrary andcounter-intuitive limitations and assumptions can be made about work-ows.

e Model is Not an Execution Plan. e Streaming Work ow Sys-tem does not actually execute a streaming work ow according to thespeci cation. Different algorithms and data structures may be used de-pending on activity con guration, expected or observed data rates andwork ow structure. Also, common subgraphs may be implemented by

Page 31: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 21 ��

�� ��

��

..Streaming Work ow System

..

..

..

.User Interface

.Formal Model

.Implementation

.Relational Database

..

..

..

.SQL

.Relational Algebra

.Physical Operators

Figure . : Position of the Formal Model in a Streaming Work owSystem.

a single algorithm. Because the model is not an execution plan, it isno problem when a transformation rule produces a work ow that seemsless efficient.

So What is the Model? e formal model is an abstraction layer thatenables the system to transform streaming work ow speci cations intoany shape. e desirability of a shape is no concern in the model.

Figure . compares the Streaming Work ow System with a rela-tional database management system (RDMS): the model sits betweenthe user interface and the actual implementation of activities, similar tothe function of relational algebra.

Page 32: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 22 ��

�� ��

��

Page 33: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 23 ��

�� ��

��

Chapter ree

Equivalence

In Section . , optimization was summarized as ‘searching for an alter-native that is better than the original’. One of the questions that arisesfrom that description is: what de nes a valid alternative? When is astreaming work ow a valid alternative for another streaming work ow?

A work ow is a valid alternative iff it can be interchanged with theoriginal without changing the output of the system. e formal conceptfor ‘valid alternative’ or ‘interchangeable’ is equivalence. Two work owsare valid alternatives for each other when they are equivalent.

Equivalence is denoted by the ≡ operator. In this thesis the nota-tions =t and ≡t are used to state that an equality or equivalence is validat a certain point in time.

. Monitoring a River the Same Way, but Different

Previous chapters have used the example of a scientist who wants toobserve dikes along a river to detect groundwater in uxes and his col-league who studies the effects of climate change on the river. What doesequivalence mean to them?

To the user of a Streaming Work ow System equivalence means thathe gets the results he expected; that the data apparently has been pro-cessed in the way he has speci ed the data to be processed. e systemmust deliver predictable, reproducible and traceable results to guaranteethe scienti c validity of his research.

Because the user has such speci c expectations and noble intentionswith the results, the system can not compromise the quality of the re-sults in any way. e only thing the system is allowed to do is replacea work ow by another work ow that is equivalent, but possibly moreefficient.

Page 34: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 24 ��

�� ��

��

Equivalence simply means that two different work ows deliver thesame results. is chapter formalizes this intuitive de nition to makeit possible to proof that transformation rules preserve equivalence inChapter .

. Equivalence of Activities

is chapter describes equivalence bottom-up, so this section starts withdescribing equivalence of activities. Intuitively, activities are equivalentwhen they produce identical output from the same input.

De nition : Equivalent Activities Two activities are de ned to beequivalent iff . they run the same code, . this code is stream determin-istic and . they have the same con guration. According to AssumptionIV, the second condition is always true.

Note that sensors and other activities at the leafs of the work ow arenever equivalent. One could either argue that they do not run code inthe sense intended in the de nition of equivalent activities or that thelocation of their deployment is part of their code. Anyhow, the rstcondition is false.

In transformation rules, activities are considered equivalent iff theyhave the same label. If the labels are different, then the activities are notequivalent. is does not prevent transformation rules from stating thattwo activities are interchangeable because work ow equivalence will bede ned in terms of view equivalence.

. Equivalence of Views

With equivalence de ned for activities, the equivalence of views can bede ned.

De nition : Equivalent Views Two views are de ned to be equiva-lent iff at all points in time their content is identical. Content includesall the attributes in the user de ned schema plus a subset of the meta-data managed by the Streaming Work ow System.

Page 35: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 25 ��

�� ��

��

e content of a view for the purpose of equivalence consists of allrecords (including empty records), with all the attributes in the userde ned schema. In addition the idealized transaction time (part of thesystem’s metadata) attribute is included; this is the timestamp used byinterval predicates to select data from views.

Adding tuple id to the content would make equivalence much harder,but not adding it restricts the kinds of tuple based triggers that can betransformed. is choice requires further consideration.

Equivalence of views can be very simple. e following lemma for-malizes a trivial case of equivalence between views. It is a special caseof the de nition above.

Lemma : Trivial Equivalence If two views are . produced by equiv-alent activities, . from equivalent input views and . with the samewindow sequence, then they are equivalent according to De nition .

e rst and second conditions must be (indirectly) true for shar-ing to be possible at all: when doing something different with the data,nothing can be shared and likewise when the data is not from a com-mon source. A lot of transformation rules target situations where thethird condition is not true, but where there is some relation between thewindow sequences that causes a subsumption relation between views.

A simple proof of Lemma can be given by induction over time.Given two views for which all three conditions hold: VA which holdsthe output of activity A and VB which holds the output of activity B.At t = 0 both views are empty according to Assumption III, so theircontent is the same. At time t the content of both views is still the same.At time t + 1 both A and B are activated and . read the same datainterval from equivalent views, . perform the same operation on thesame data and . produce the same output tuples which are appendedto VA and VB . So at time t+ 1 the contents of VA and VB are still thesame. us VA and VB are equivalent by De nition . �

e rst rule discussed in Chapter is proofed by showing trivialequivalence. e second rule shows a proof for a situation where thethird condition of Lemma is not met.

Page 36: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 26 ��

�� ��

��

Strict Equivalence Weak Equivalence

. Zip

. Copycat. Shared Loop

. Copy Bypass . Copy Elimination

. Union Associativity

. Selection Pushdown

. Masquerade. Single Input Union. Idempotent Activities

. Collapse and Split Select

. Intermediate Project

. Aggregate Split. Shared Join

Table . : Transformation Rules by Equivalence Type

Page 37: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 27 ��

�� ��

��

. Equivalence of Streaming Work ows

Based on the de nition of equivalent views, the de nition of equivalentwork ows can be given.

De nition : Equivalent Work ows (Strict) Two streaming work-ows are equivalent iff for each view in either work ow, an equivalent

view exists in the other.

Many transformation rules introduce views to the work ow to storeintermediate results that can be shared and reused. As a result theserules are not valid under the strict de nition of work ow equivalence.From a practical (user) viewpoint however additional views are no issue.

erefore a weaker de nition of work ow equivalence is added.

De nition : Equivalent Work ows (Weak) Given a subset of theviews, two streaming work ows are equivalent iff for each view in theset, equivalent views exist in both work ows.

is de nition weakens the equivalence by restricting the number ofviews that has to have an equivalent. In practice it will be no problemthat the set of equivalent views is restricted because there is an asym-metry as described next.

De nitions and are symmetrical because an equivalence rela-tion has to be symmetrical. In reality there is asymmetry: one work owwill be the speci cation written by the user and the second work owis the internal work ow used by the system runtime. e subset usedfor equivalence is the set of views in the user speci cation. Addition-ally, many users are only interested in the end results produced by theirwork ow. Intermediate views can therefore also be removed from thesubset used for equivalence.

Finally, any transformation rule can be rewritten to preserve strictequivalence. is is shown by the Copy Elimination/Bypass Rule inSection . . By performing only the additions speci ed by the rule andnot the deletions, a work ow is created with a maximum set of equiva-lent views. It is just a little odd to call this a ‘transformation’. By exten-sion this means that a Streaming Work ow System can always recreate

Page 38: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 28 ��

�� ��

��

views when the subset of views used for equivalence (i.e. user interests)changes.

Comparing Strong and Weak Equivalence Table . lists the trans-formation rules presented in the next chapters classi ed by the type ofequivalence. Rules with the same type of equivalence also tend to shareother characteristics.

An interesting observation is the fact that rules that adhere to strictequivalence identify duplicate views in a work ow or identify a sub-sumption relation between views. Application of these rules is alwaysbene cial from a computational or storage perspective. Rules that onlypreserve weak equivalence try to make duplicate work explicit, allowingthe results to be shared and the duplication to be eliminated. Addition-ally these rules can be used to trade computational cost for storage costand vice versa.

Independence of Equivalence e de nition of equivalence given isindependent of both the work ow structure and the execution model.

is has the following advantages:

• Transformation rules can structurally change the work ows, aslong as the required views are still present.

• e formal model can be changed and implementations can di-verge from it. A realistic example would be a ltering activityor a join activity that produces two output views based on twoindependent predicates.

• Part of a work ow de nition can use other semantics and a dif-ferent execution model. e de nition of equivalence allows theformal model to be treated as a subset of the original model de-ned by [ ].

. Local and Global Equivalence

Streaming work ows are composed from multiple activities, each ofwhich has its own semantics. It would be practical if transformationrules can be de ned just for basic situations, focussing on a single se-mantic aspect of a speci c type of activity. is is possible because globalequivalence is preserved by preserving local equivalence.

Page 39: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 29 ��

�� ��

��

Lemma : Local Equivalence is Global Equivalence When a sub-graph of a work ow is transformed, it is sufficient to show that equiva-lence is preserved in that subgraph to proof that equivalence is preservedfor the full work ow graph. As a consequence, each transformation rulecan be discussed in isolation.

To proof Lemma , consider transformationT , the setW of all viewsin a work ow and a subset S ⊆ W of views affected by T . Equivalenceis de ned under the set of views E ⊆ W (for strict equivalence E =

W ).For all views E ∩ S it is guaranteed that equivalence exists on both

sides of the transformation, because T is a valid transformation rule.e views E ∩ (W \ S) are not affected by the transformation rule, so

these do not change equivalence. Since E = (E∩S)∪ (E∩ (W \S)),equivalence under E of the complete work ow is preserved. �

With this lemma it is possible to transform a subgraph of a work ow,without worrying about the effect of the transformation on the wholework ow. It also means that not the entire work ow has to use thesemantics de ned in the formal model.

Page 40: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 30 ��

�� ��

��

Page 41: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 31 ��

�� ��

��

Chapter Four

Transformation RulesIn this chapter a series of generic transformation rules is presented.Each of these rules is valid within the formal model from Chapter .

e rules demonstrate that the formal model can be used to de ne trans-formation rules for generic situations (Research Question a).

Each rule in this chapter is informally described using graphs andtext. e Copycat Rule (Section . ) contains a full formal proof; forthe other rules an indication is given of how a similar proof could beconstructed.

Section . starts with a simple rule. In Section . a work ow istransformed using all the rules presented in this chapter and in Section. the river example is used to illustrate how the Streaming Work ow

System can optimize an evolving work ow.

DescribingTransformationRules Transformation rules are describedvisually by two graphs that represent the before (left hand side, or lhs)and after (right hand side, or rhs) state. In addition preconditions andintroduced window sequences are stated. Quanti cation of the graphstructure and preconditions is handled either intuitively or by choos-ing a speci c case; further research or an implementation will have toformalize this to make the rules generic.

Applicability of Rules A transformation rule is applicable when . asubgraph match is found, . all preconditions are true, . all variablesreferenced by the rule are de ned and . in each precondition and vari-able de nition the lhs and rhs are of the same type.

. Zip Rule

is section starts with a very simple rule to reiterate the concepts fromthe formal model. Be not deceived by its simplicity as it is the most

Page 42: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 32 ��

�� ��

��

Before transformation (lhs):

.

.

. ...Act ... .Vout

....Vin

. ...Act ... .Vout′

.w

.w

After transformation (rhs):

.. ....Vin ...Act ... .Vout.w

Figure . : Zip Rule

Page 43: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 33 ��

�� ��

��

fundamental rule of all. Many other rules increase the complexity ofthe work ow graph, trying to create possibilities for the Zip Rule to beapplied.

e Zip Rule is the embodiment of the notion of trivially equiva-lent views from Lemma . To proof the validity of the rule, start withproo ng that Vout and Vout′ are trivially equivalent. Views must meetthree conditions to be trivially equivalent:

. Produced by equivalent activities.Check; expressed by the fact that all activities in Figure . havethe same label.

. From equivalent input views.Check; the input views of all activities in Figure . are identical,which is stronger than equivalent.

. e same window sequence.Check; all input relations in Figure . have the same label.

For each view on the left hand side (lhs), an equivalent view exists onthe right hand side (rhs), because Vout and Vout′ are trivially equivalent.

erefore the work ow before application of the rule is strict equivalentwith the work ow after application of the rule [De nition ]. �

. Copycat Rule

e Copycat Rule is very similar to the Zip Rule. Only one conditionfrom the trivial equivalence of views is not ful lled in the following sit-uation. is section shows a non-trivial proof for a rule and illustratesscenarios that a Streaming Work ow System is expected to optimize.

Scenario Although copying someone else’s work maybe a crime in anacademic context, being ‘inspired’ by the way a colleague deals with cer-tain tasks is absolutely acceptable for day-to-day work. Reinventing ex-isting solutions is often not the best way to spend time.

e Copycat Rule enables work ow optimizations in the followingscenarios. For the second and third scenario it is important to remem-ber that from the system’s point of view work ows never change (seeAssumption III).

Page 44: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 34 ��

�� ��

��

Before transformation (lhs):

.

.

. ...Act1 ... .V2

....V1

. ...Act2 ... .V3

.w1

.w2

After transformation (rhs):

.. ....V1...Act1 ...

.V2

...Copy ... .V3

.w1 .w3

Preconditions:Act1 and Act2 are equivalentw1 and w2 are both time-basedepoch(w1) = epoch(w2)delta(w1) = delta(w2)offset(w1) = offset(w2)ws(w1) = ws(w2)∃n > 0 : epoch(w2) = epoch(w1) + n ∗ delta(w1)

De nition of w3:epoch(w3) = epoch(w2)delta(w3) = delta(w2)offset(w3) = 0ws(w3) = delta(w2)

Figure . : Copycat Rule

Page 45: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 35 ��

�� ��

��

• A user sees a work ow from a colleague and thinks that it is reallya clever idea. So the user creates an (almost) identical work owfor himself. is is a literal copycat situation!

• Some task is reassigned. Not familiar with the task, the new as-signee starts off by doing things the way the former assignee did.When he understand things better, he will make his own modi-cations.

• Some business practice is repeated regularly (e.g. every scal pe-riod). Each time the procedure is slightly tweaked, resulting in anew work ow being added to the system.

All three scenarios share one observation: the main difference be-tween two work ow speci cations is when they were speci ed. In termsof window sequences all variables are the same except for the epoch vari-able.

e Rule e Before and After situation in Figure . show the struc-tural changes to the work ow. e preconditions formalize the obser-vations mentioned before: everything is the same, except the epoch.

e preconditions contain one special requirement: the epochs mustalign. As a consequence of this alignment and the other preconditions,the windows de ned by w2 are a proper subset of the windows de nedby w1. is means that V3 is a proper subset of V2. Since it is knownwhen Act2 was de ned, all that needs to be done to create V3 from V2 iscopy everything added after Act2 was de ned. Copy and w3 are de nedin such a way that everything in V2 that also should be in V3 is copied.Since w3 speci es the same reference points as w1, new data added toV2 is copied to V3 in the same system state transition.

Formal Proof For a transformation to be correct, the work ow beforeapplication must be equivalent to the work ow after application. Be-cause equivalent work ows have equivalent views [De nition ], it issufficient to proof that V2 ≡ Vcopy in a work ow that combines the lhsand rhs of the transformation as shown in Figure . .

In this proof the assumption is made that all activities have been de-ned before the start of the system with different epochs and the trans-

formation has been applied but without removing the replaced activi-ties/views. is variation from production scenario does not affect the

Page 46: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 36 ��

�� ��

��

.

.

. ...Act1 ....V1

...Copy ... .Vcopy

....V1

. ...Act2 ....V2

.w1

.w2

.wcopy

Figure . : Copycat Rule LHS and RHS Combined

Page 47: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 37 ��

�� ��

��

validity of the proof because the epochs cause the activities to behave asif they have yet not been de ned. e work ow system runtime treatsan activity with a future epoch as non-existent. (Of course optimizationcan already be applied to these activities.)

Equivalence of views is de ned as ‘having the same content at allpoints in time’ [De nition ]. erefore the equivalence V2 ≡ Vcopy isshown with a proof by induction, using the time t as induction variable.

Σt represents the state of the system at time t before any activityenabled at t is executed. Σt+1 represents the state of the system after allactivities enabled at t have been executed. For intermediate states thenotation Σt′ is used.

Given

. e combined work ow of both sides of the transformation. SeeFigure . .

. e preconditions for the transformation:

a) Act1 ≡ Act2b) w1 and w2 are both time-basedc) ∃n > 0 : epoch(w2) = epoch(w1) + n ∗ delta(w1)

d) (implied by above: epoch(w1) < epoch(w2))e) delta(w1) = delta(w2)

f ) offset(w1) = offset(w2)

g) ws(w1) = ws(w2)

. e window wcopy as de ned by the transformation:

a) epoch(wcopy) = epoch(w2)

b) delta(wcopy) = delta(w2)

c) offset(wcopy) = 0

d) ws(wcopy) = delta(w2)

(select all since prev execution of Act1)

. Copy verbatim copies valid on of input tuples.

Page 48: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 38 ��

�� ��

��

Basic Step At t = 0, all views in the system Σ0 are empty. Sinceempty views have identical content, the conclusion is that V2 =0 Vcopy,by de nition of an empty system [Assumption III].

InductionStep According to the induction hypothesis, in system stateΣt, it is true that V2 =t Vcopy. In the next system state Σt+1, it must betrue that V2 =t+1 Vcopy.

ere are cases to consider. t either aligns with all window se-quences or does not align with any. If t aligns, then it falls into one ofthree sections of a timeline partitioned by epoch(w1) and epoch(w2) asillustrated in Figure . . e rst execution of an activity is no differentfrom any other execution of that activity. erefore t = epoch(w1) andt = epoch(w2) are not separate cases.

.

.epoch(w1)

.epoch(w2)

Figure . : Partitioned Timeline

e rst thing to proof is that either all or none activities align witht. As a result, no cases have to be considered in which some activitiesdo align and some activities do not align.Because epoch(wcopy) = epoch(w2) [ a] and delta(wcopy) = delta(w2)

[ b], Copy aligns trivially with Act2 by de nition of alignment. Be-cause delta(w1) = delta(w2) [ e] and [ c], Act1 and Act2 also align (infact [ e]+[ c] is the de nition of alignment, see De nition )Alignment is an equivalence relation (thus transitive) therefore all ac-tivities align and therefore all activities do align with t or all activitiesdo not align with t.

Lets consider the rst case:

• No alignment.t is not a reference point of any of the activities in the work ow.None of the windows align with t so none of the activities is en-abled for t, no activity is executed and there are no changes to theviews and thus Σt+1 = Σt =⇒ V2 =t+1 Vcopy.

Page 49: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 39 ��

�� ��

��

In the following cases, the windows of all activities align with eachother and with t. Because t is a valid reference point of at least one ofthe activities in the work ow system, it is called Now. is switch ofterminology is made to relate to the notation used to specify work owsmore closely.

• Before the rst time Act1 is enabled.Now < epoch(w1) ⇒ Now < epoch(w2) ∧ Now < epoch(wcopy)

none of the activities is executed and thusΣt+1=Σt ⇒ V2 =t+1

Vcopy [ d, a]

• Starting the rst time Act1 was enabled but before the rst timeAct2 is enabled.Now ≥ epoch(w1) ∧ Now < epoch(w2)

– Act1 is executed and adds n ≥ 0 tuples to V1

– Act2 is not executed so V2 does not change– Copy is not executed because

epoch(wcopy) = epoch(w2) [ a] so Vcopy does not change– V2 =t Vcopy and nothing changes, thus V2 =t+1 Vcopy

• Starting the rst time Act2 is enabled.Now ≥ epoch(w2)

– Because of the input dependency relation between Act1 andCopy, the work ow system will ensure that Copy is alwaysexecuted after Act1 for the same value of Now. [Assump-tion II] ere are possible execution orders: (Act1,Act2,Copy), (Act1,Copy,Act2) and (Act2,Act1,Copy). In thisproof the rst order is assumed. e other two orders canbe proven similarly to result in V2 =t+1 Vcopy.

– Act1 is executed because epoch(w1) < epoch(w2) [ d] andadds zero or more tuples to V1, producing Σt′ = Σt ∪output(Act1)

– Act2 is executed and adds n tuples to V2, identical to the tu-ples Act1 has added to V1 because Act2 ≡ Act1 ∧w1 ≡Noww2∧V1 = V1 (same operation, same selection, same sourcethus same output), producing Σt′′ = Σt′ ∪ output(Act2)[ a, efg, ]

Page 50: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 40 ��

�� ��

��

Before transformation (lhs):

....V1

..Min.w..

.V2

After transformation (rhs):

....V1

..CA.w..

.V3

..πMin.w2

...V4

De nition of w2:epoch(w2) = epoch(w)delta(w2) = delta(w)offset(w2) = 0ws(w2) = delta(w)

Figure . : Shared Loop Rule

Page 51: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 41 ��

�� ��

��

– Now V2 has more tuples than Vcopy; but execution takes zerotime [Assumption I], so this is not a point in time for thepurpose of view equivalence.

– Copy is executed because epoch(wcopy) = epoch(w2) [ a].Execution takes place after Act1 has added n tuples to V1 .

e window wcopy selects exactly the n tuples just added toV1. Copy copies n tuples to Vcopy, producing Σt+1 since allactivities have been executed.

– In Σt+1 n tuples have been added to both V2 and Vcopy.ese tuples are equivalent because (indirectly) they have

been produced by equivalent activities from the same inputdata. us Copy must be implemented so that it literallycopies the valid on value from the input tuples. [ ])

– us V2 =t+1 Vcopy

ConclusionofProof For all t has been proofed thatV2 =t Vcopy , thusby de nition of view equivalence V2 ≡ Vcopy in the combined work owin Figure . .

Since the other views have not been rede ned, every view on the lhsof the rule has an equivalent on the rhs of the rule and vice versa. erule preserves strict equivalence and thus is the Copycat Rule a correcttransformation of a streaming work ow. �

. Shared Loop Rule

is rule uses a very simple observation: if you need both the Min andMax of a set of data, you can nd them both by iterating over the dataonly once. A Common Aggregate (CA) activity does exactly that: itcomputes various common aggregates like Min, Max and Count over aset of data. To present the user with the view he expected, the rule addsan extra Project (π) activity to select only the requested aggregate.

e validity of this rule is proofed by treating it as the Copycat Rulelimited to a subset of schema attributes. A Common Aggregate activityis the aggregate it replaces, except that it outputs more attributes andProject is Copy except that it outputs less attributes.

Contrary to the Copycat Rule, the Shared Loop does not preservestrict equivalence. Because the view produced by the Common Aggre-gate contains more attributes, the rule only preserves weak equivalence.

Page 52: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 42 ��

�� ��

��

Before transformation (lhs):

.. ....V2...Copy ...

.V3

...Consumer.w3 .w4

After transformation (rhs):

.. ....V2...Consumer

.w4

Preconditions:w4 ⊆ w3

Consumer can be any activity.

Figure . : Copy Elimination Rule

After transformation (rhs):

.. ....V2...Copy ...

.V3

...Consumer.w3

.w4

Figure . : Copy Bypass Rule (Righthand Side)

Page 53: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 43 ��

�� ��

��

. Copy Elimination/Bypass Rule

is section shows how a rule that only preserves weak equivalence canbe adapted to preserve strict equivalence. First the weak variant is ex-plained.

e Copycat Rule adds a Copy activity to a work ow. Often the lteringperformed by the Copy activity is redundant because the consumer of theCopy activity has a window sequence that already does that. is willbe the case when the user has de ned a long work ow with multipleprocessing steps.

Figure . shows the Copy Elimination Rule that takes advantage ofthat situation. e statement w4 ⊆ w3 in the preconditions means thatthe reference points of w4 are a subsequence of the reference points ofw3 (thus fewer windows), or that each window of w4 selects a subset ofthe corresponding window of w3 (smaller windows); it can also meanboth. Because user will often de ne work ows with multiple processingsteps on some data (like a pipeline), situations where w3 = w4 areexpected not to be uncommon.

e Copy Elimination Rule not only eliminates an activity, it alsoeliminates the corresponding view. erefore the rule does only preserveweak equivalence between the lhs and rhs but no strict equivalence. isis no problem, assuming that the user is not interested in the interme-diate results.

Now, the users interests have changed. To debug/inspect his work owthe user wants to see the intermediate views. e Streaming Work owSystem can no longer apply the Copy Elimination Rule because it breaksequivalence when V3 is included in the subset over which equivalence isde ned.

Figure . shows an alternative transformation of the work ow thatexploits the same idea of removing redundant ltering. is rule onlybypasses the Copy activity and is hence called the Copy Bypass Rule.Because only an input relation is rerouted, the transformation preservesstrict equivalence because for each view on either side of the transforma-tion an equivalent view exists on the other side, namely the exact sameview.

Page 54: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 44 ��

�� ��

��

a. Work ows as speci ed and seen by the user(s):

.

.

....Vin ...σp ... ...Min ... .Vout

....Vin ...σp ... ...Max ... .Vout′

.we .w1

.we .w2

b. In the system runtime:

.

.

. ...σp ... ...Min ... .Vout

....Vin

. ...σp ... ...Max ... .Vout′

.we

.w1

.we.w2

c. Apply Zip Rule:

.

.

. . . ...Min ... .Vout

....Vin ...σp ...

. . . ...Max ... .Vout′

.we

.w1

.w2

Figure . : Example of a Multi-step Work ow Transformation

Page 55: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 45 ��

�� ��

��

e precondition w4 ⊆ w3 implies alignment between the windowsequences; therefore the validity of the Elimination and the Bypass Rulecan be proofed similarly to the Copycat Rule with induction over time.

. A Multi-Step Example

To illustrate how work ow transformation works, this section presentsa bigger examples that includes all rules described in this chapter. ework ows are show in Figures . and . .

In the example two work ows have been speci ed. One of them cal-culates the seven day sliding minimum value over a subset of measure-ments that satisfy a certain preposition. e other calculates the sevenday sliding maximum over the same subset, although this work ow hasbeen de ned to start processing two weeks later than the rst.

a. To start with, the work ows such as speci ed and seen by theusers of the Streaming Work ow System are shown. e two work-ows are completely separate from the user’s point of view even though

they share the same view as data source. Because the work ows havenbeen de ned two weeks apart, the Min and Max activities have differentwindow sequences.

b. If the two work ows were speci ed by the same user, he couldhave created a single work ow with multiple output views instead. Re-gardless of how the user(s) has/have speci ed the work ows, within thework ow system’s runtime they are one work ow because they share aview.

c. e user did not care about when the selection is actually per-formed and has therefore used the standard event-based window se-quence we, so the rst transformation that can be applied is the ZipRule. Both branches of the work ow perform the same selection onthe same set of data using the same window sequence. If the user hadspeci ed different window sequences, the system would have been ableto use the Copycat Rule on the selection activities instead.

d. Next, the Shared Loop Rule is applied to both branches of thework ow individually. is results in two additional Project activitiesbeing added to lter out the requested aggregate values. e user didnot require delayed processing, so offset(w1) = offset(w2) = 0, and as a

Page 56: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 46 ��

�� ��

��

d. Apply Shared Loop Rule to Both Branches:

.

.

. . . ...CA ... ...πMin ... .Vout

....Vin ...σp ...

. . . ...CA ... ...πMax ... .Vout′

.we

.w1

.w2

.w1

.w2

e. Apply Copycat Rule:

.

.

. . . . . ...πMin ...

....Vin ...σp ... ...CA ...

. . . . . ...Copy ... ...πMax ... .Vout′

.Vout.we .w1

.w1

.w2.w2

f. Apply Copy Elimination Rule:

.

.

. . . . . ...πMin ... .Vout

....Vin ...σp ... ...CA ...

. . . . . ...πMax ... .Vout′

.we .w1

.w1

.w2

Figure . : Example of a Multi-step Work ow Transformation (cont.)

Page 57: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 47 ��

�� ��

��

consequence the window sequences of the new projection activities arethe same as for the aggregation activities.

e. Because the only difference between w1 and w2 is the epoch, thetwo Common Aggregate activities are an easy match for the CopycatRule. One aggregate is removed and replaced by a Copy activity.

f. Given that w2 ⊆ w2, the preconditions for the Copy EliminationRule are ful lled and the Copy activity can be removed.

e work ows in a. and f. are weak equivalent. e view producedby the Common Aggregate has no equivalent in the original work owand must therefore be excluded from the equivalence. Weak equiva-lence is acceptable here because this is exactly the asymmetric situationas explained in Section . .

. Monitoring a River Slightly Differently

Reconsider the streaming work ow from Figure . . A scientist hasdevised a method to detect groundwater in uxes in dikes along a river.He uses a streaming work ow to lter and aggregate the sensor databefore feeding it into the In ux Detector activity

e scientist has concluded that he needs to lter out more mea-surements because the temperature sensors generate more noise thanexpected. He changes the activity into a activity and submitsthe new work ow speci cation to the system. Using the Zip Rule, thesystem can share the ‘ ow’ branch of the work ow between the originaland the revised work ow.

In another update the scientist adds Avg, Min and Max activities tothe ‘ ow’ branch because they are needed by his new In ux Detectionalgorithm. Again the system uses the Zip Rule to reuse the unchangedparts of the work ow. Additionally the system applies the Shared LoopRule, so the extra aggregates have virtually no extra processing costs.

e story above assumes that the scientist wants the system to repro-cess the data already processed by the old work ow. If he had speci edthat only new data has to be processed by the new work ow, the sys-tem could have applied the Copycat Rule to obtain basically the samesharing as achieved through the Zip Rule.

Page 58: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 48 ��

�� ��

��

Page 59: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 49 ��

�� ��

��

Chapter Five

Tuple-based Transformation Rules

Many activities manipulate views in terms of individual tuples. Com-mon characteristics of these activities are: . they operate at the tuplelevel, . time-based selection is meaningless because of this and . theyare used to manage data and have no intrinsic purpose in the work ow.

In the ‘Monitoring a River’ example in Figure . tuple-based ac-tivities have been left out for clarity, although the work ow implicitlycontains a Union to collect all sensor data from a speci c type of sen-sor before ltering the data. If the scientist would want to limit hisresearch to a smaller section of the river, he would add Select activitiesto the work ow. e usage of both an implicit Union activity and op-tional Select activities illustrates that tuple-based activities are for datamanagement and have no intrinsic purpose.

Many operators in relational algebra can be classi ed as tuple-basedactivities. e most notable exception is the Join; a Join does not quitematch the characteristics above, setting it apart from the other the ac-tivities discussed here. e Join activity is not only discussed in Section. but also in the next chapter.

is chapter starts with two rules that do come directly from rela-tional algebra.

. Union Associativity Rule

A common activity is Union. It copies all tuples from any input view tothe output view, preserving all tuples and all attributes. e output viewis sorted on time. Contrary to the union operator in relational algebra,a Union activity does not have to perform duplicate elimination as alltuples are unique by source and timestamp.

A well-known property of the union operator is associativity; whichmeans that (A ∪B) ∪ C = A ∪ (B ∪ C).

Page 60: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 50 ��

�� ��

��

Before transformation (lhs):

.

.

....V1. .

....V2...Union ... .V4

....V3. .

.we

.we

.we

After transformation (rhs):

.

.

....V1. . . .

....V2...Union ... . .

. . ... ...Union ... .V4

.we

.we

.we

.we

.V3

Figure . : Union Associativity Rule

Page 61: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 51 ��

�� ��

��

Before transformation (lhs):

.

.

....Vin1

. ...Joinpred ....Vjoin

...σp ... .Vout

....Vin2

.wjoin

.wjoin

.we

After transformation (rhs):

.

.

....Vin1 ...σp ....Vin1′

.

. . . ...Joinpred ... .Vout

. . ...

.we

.wjoin

.wjoin.Vin2

Preconditions:atoms(p) ⊆ schema(Vin1)atoms(p) ∩ schema(Vin2) = ∅

Figure . : Selection Pushdown Rule

Page 62: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 52 ��

�� ��

��

Before transformation (lhs):

.

.

....V1. .

.. ...Act ... .Vout

....Vn. .

.wt

.wt

After transformation (rhs):

.

.

....V1. . .

.. ...Act ....Vi

...Copy ... .Vout

....Vn. .

.we

.we

.wt

Preconditions:wt is time- or tuple-basedAct is a tuple-based activityAct has ≥ 1 inputs

Figure . : Masquerade Rule

Page 63: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 53 ��

�� ��

��

In Figure . a transformation rule is shown for the case of threeinput views. Because of the fresh tuple semantics of a Union and thewe window sequence used, the activities behave identical to the rela-tional operator and the associativity known from relational algebra canbe directly used.

e usefulness of this rule is illustrated with the following scenario.In a sensor deployment constantly sensors are being added and removed.Assume a work ow is de ned with a union of V1 and V2. Rememberthat work ows are immutable [Design Choice III]. Now a new sensoris deployed and a new work ow is added to the system with a union ofV1, V2 and V3. Using the Union Associativity Rule, the system only hasto add the tuples from V3 to an existing Union. e Union AssociativityRule is useful when considering many slightly different work ows overtime because work ows are immutable.

. Selection Pushdown Rule

e rule in Figure . shows another well-known principle from rela-tional algebra: selection can be pushed through joins. e bene t ofdoing this is reduction of the amount of tuples that have to be pro-cessed by the expensive Join activity. In [ ] it is observed that manycontinuous querying systems do the opposite: they push down the Jointo increase the potential for sharing. is rule can be used both ways.

In [ ] it is argued that both pulling up and pushing down are notoptimal and another approach is presented that prevents unnecessarywork (which they call zombie tuples) but also prevents the same opera-tion being applied to a tuple twice.

. Masquerade Rule

Most tuple-based activities use the fresh tuple semantics [De nition ],which in combination with the standard tuple based window sequencewe [De nition ] results in behavior very similar to their namesakes inrelational algebra. e previous two rules only transform work ows thatadhere to this standard.

Sometimes these activities however do not use the standard tuplebased window sequence we, because the user has speci ed another win-dow sequence. Although the usefulness of other window sequences is

Page 64: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 54 ��

�� ��

��

Before transformation (lhs):

.. ....V1

...Union ... .V2

.we

After transformation (rhs):

.. ....V1

Figure . : Single Input Union Rule

Before transformation (lhs):

.. ....V1...Act ...

.V2

...Act ... .V3.w .w

After transformation (rhs):

.. ....V1...Act ... .V3

.w

Preconditions:Act is idempotent when combined with w

Figure . : Idempotent Activities Rule

Page 65: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 55 ��

�� ��

��

unclear, it is clear that they can be speci ed in the formal model. ere-fore, transformation rules have to take them into account. Doing so ona per rule basis would lead to unnecessary rules and complexity withinrules.

e Masquerade Rule as shown in Figure . converts an inherentlytuple based activity into an activity with the standard window sequencewe. To hide the difference (in particular the update frequency), an ad-ditional Copy activity with the original window sequence is added to thework ow. After application of the Masquerade Rule, tuple-based ac-tivities can be transformed using transformation rules directly derivedfrom relational algebra.

When the Copy Elimination Rule can be applied after applicationof the Masquerade Rule, the introduced overhead is eliminated. iswill often be the case as argued in Section . . In this way the sys-tem eliminates an arbitrary restriction de ned by the user, increasingthe performance of the individual work ow and the shareability of theintermediary results.

Because an intermediary view is introduced, the Masquerade Ruledoes preserve weak equivalence but not strict equivalence. A proof ofthis equivalence would use induction over time to show that the intro-duced view Vi is always ‘ahead’ of the view Vout and that the Copy activitycopies the tuples at the correct time to Vout. To proof that the contentof the tuples is correct, consider that Act is tuple based and that Copydoes not change the contents of the tuples.

Other Uses Similarly to the Copycat Rule, only fresh tuples are se-lected and copied by the introduced Copy activity because the windowsequence is partitioning. is way, the Masquerade Rule also handlessituations where a Union is de ned with a sliding window. e Unionactivity is fresh tuple based so this does not have any performance im-pact, but it makes the Union more shareable because after applicationof the rule it is only unique in the combination of its offset and inputrelations.

Another alternate use of the Masquerade Rule is when a hoppingwindow is de ned for a Union activity. e hopping causes the Unionto act like a timeline selection. is selection can be done by a separateactivity, increasing the potential for sharing the Union.

Page 66: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 56 ��

�� ��

��

Before transformation (lhs):

.. ....V1...σp ...

.V2

...σq ... .V3

.we .we

After transformation (rhs):

.. ....V1...σp∧q ... .V3

.we

Figure . : Collapse/Split Select Rule

Before transformation (lhs):

.

.

. ...πa ... .Va

....V1

. ...πb ... .Vb

.we

.we

After transformation (rhs):

.

.

. . . ...πa ... .Va

....V1...πa∩b ...

. . . ...πb ... .Vb

.we

.we

.we

.Vab

Figure . : Intermediate Project Rule

Page 67: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 57 ��

�� ��

��

. Other Rules

Given the Masquerade Rule, many more rules known from relationalalgebra can be directly mapped to work ow transformations. is sec-tion gives four examples. All these rules assume the Masquerade Rulehas been applied when necessary to transform the work ow to only usethe we window sequence.

Single Input Union Rule ere is nothing (in the formal model) pre-venting the user from creating a Union activity with only a single inputrelation. Also, the system may be able to deduce that some input viewwill always be empty and can therefore remove an input relation, result-ing in a Union with one input. In that case, the Union does effectivelynothing and can be removed with the rule shown in Figure . .

Idempotent Activities Rule In some cases an activity does nothingwhen applied twice because it is idempotent. Figure . shows a rulethat eliminates one of the activities in such as case.

Select and Project are idempotent combined with any window se-quence because these activities follow the fresh tuple semantics. Copy incombination with a partitioning window sequence is also idempotent.

Time-based activities can also be idempotent. Any distributive ag-gregate over a certain time interval that is its own super aggregate isidempotent. An aggregate over a tuple-based interval is not idempo-tent, because a tuple interval has no universal meaning.

Collapse and Split Select Rule Selection is commutative, thereforemultiple Select activities can be combined, split-up and (by repeated ap-plication of the rule) reordered. Figure . shows the rule; which sideof the transformation is the before state and which side the after state israther arbitrary.

Intermediate Project Rule e rule in Figure . allows the systemto insert an extra intermediate Project activity. is could be done tominimize the size of tuples to be stored after application of the SharedLoop Rule (see Section . ), by eliminating attributes that are nevergoing to be used.

Page 68: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 58 ��

�� ��

��

Page 69: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 59 ��

�� ��

��

Chapter Six

Advanced Transformation Rules

In this chapter transformation rules for two important activity types willbe discussed: aggregates and joins. Much research has been done for theoptimization of aggregates and joins in the context of streaming queryprocessing. e rules in this chapter will demonstrate that the formalmodel can handle these common and important activity types.

. Aggregate Split Rule

Much research has been done to optimize aggregates over sliding win-dows in streaming query processing [ , ]. Also the ‘Monitoring aRiver’ example contains plenty aggregates: the majority of activities inFigure . is an aggregate (sensors do not count in this case since theydo not process data).

A common optimization strategy is that for many aggregates thevalue over a big window can be computed from the values of smallerwindows. e Aggregate Split Rule uses this idea to optimize a certainsubclass of aggregates: distributive aggregates.

De nition : Distributive Aggregate An aggregate function f() isdistributive iff there exists a function g() such that f(A ∪ B) can becomputed using g({f(A), f(B)}). Examples of distributive aggregatefunctions are count, sum, min and max. For count, g = sum, for the otherexamples g = f . ( is de nition is taken from [ ])

e function g is referred to as the SuperAggregate in the trans-formation rule. In the work ow model a view with the intermediaryresults of a distributive aggregate has the same relational schema as theview with the end results. is is not true for other types of aggregates.

An avg aggregate, common as it is, is not distributive because theintermediate results need to contain both sum and count. e Common

Page 70: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 60 ��

�� ��

��

Before transformation (lhs):

....V1

..Aggregate.w..

.V2

After transformation (rhs):

....V1

..Aggregate.w1..

.V3

..SuperAggregate.w..

.V2

Preconditions:w is time-based and sliding.Aggregate is distributive.∃n ∈ N1 : ws = n ∗ delta

De nition of w1:epoch(w1) = epoch(w)delta(w1) = delta(w)/noffset(w1) = offset(w)ws(w1) = ws(w)/n

Figure . : Aggregate Split Rule

Page 71: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 61 ��

�� ��

��

Aggregate activity introduced in the Shared Loop Rule in Section .is distributive and can be de ned to include avg.

Observations Application of the Aggregate Split Rule will optimizea work ow based on the following observations.

• In a sliding window sequence, each window overlaps at least twoother windows. Given the semantics of a distributive aggregate itis possible to compute the value of a aggregate from the values ofsmaller aggregates [De nition ]. ese smaller aggregates canbe used in the computation of multiple bigger aggregates whenthey are stored in a view. is sets the rule apart from other rules:it enables sharing within a single work ow.

• Many users will look at the same data at various levels of aggrega-tion (drilling up and down). So often data is needed at multipleaggregation levels.

• Creating intermediate results from smaller windows increases thechance of intermediate results being shareable: in the rhs in Fig-ure . there is a possibility that a view equivalent to V3 is al-ready available in the system; this effectively eliminates addedcosts when a new work ow is added.

e Rule Figure . shows the rule. e rule is very simple and effec-tively only rewrites De nition using the syntax of the formal model.

Cost Reduction is transformation rule reduces the total amountof work the work ow system must do because cost(Aggregate, w1) +

cost(SuperAggregate, w) is less than cost(Aggregate, w) under the as-sumption that size(V3) ≪ size(V1). Assuming that the cost of exe-cuting the SuperAggregate is very low compared to the Aggregate, thetotal cost is reduced to (1/n) where n is the number of partitions intowhich the original window is split.

An example is given in Figure . in which the computational costof activation is shown for each window. If V1 contains one tuple perminute and w is de ned as ‘last two hours, update every hour’ thenthe Aggregate has to process 120 tuples every hour. After applyingthe rule, the Aggregate has to process 60 tuples every hour and the

Page 72: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 62 ��

�� ��

��

Before transformation (lhs):..w .view V1

.n

..cost:

..cost:

After transformation (rhs):..w1 .view V1

.n

..cost:

..cost:

..cost:

..w .view V3

.n

..cost:

..cost:

Figure . : Computational Costs in the Aggregate Split Rule

SuperAggregate has to process only the 2 tuples produced in the lasttwo hours by the Aggregate. us the cost of execution has droppedfrom 120 to 60 + 2 = 62, assuming the amount of tuples is a reason-able cost indicator.

is reduction of computational cost clearly does not apply to par-titioning and hopping window sequences since there is no overlap toexploit. A different optimization criterium for a Streaming Work owSystem is spreading system load. When the Streaming Work ow Sys-tem is optimizing for system load, the Aggregate Split Rule can be usedon partitioning and hopping window sequences to break up large chunksof work so system load can be spread more evenly.

Page 73: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 63 ��

�� ��

��

Repeated Application e SuperAggregate is a distributive aggregateitself, allowing the rule to be applied repeatedly, creating a hierarchy ofaggregates along the time domain. Repeated application is limited bythe fact that when the difference in size between V1 and V3 becomes toosmall, the assumption cost(Aggregate, w1)+cost(SuperAggregate, w) <

cost(Aggregate, w) breaks.

Formal Model Limitations e cost reduction achieved by the Ag-gregate Split Rule is based on the assumption size(V3) ≪ size(V1).When the rule is applied repeatedly, the assumption breaks as observedin the previous paragraph.

Another case in which the assumption breaks is when a big windowis frequently updated; for example: ‘last week, update every minute’.As a week does contain 10080 minutes, is it very likely that the as-sumption breaks. In ‘Resource sharing in continuous sliding-windowaggregates’ [ ] this is avoided by using a hierarchy of aggregates. Tocompute an aggregate over a week sized window, compute daily andhourly aggregates and compute per minute aggregates. Next, to com-pute an aggregate over a 10080 minute window, start with the daysthat are fully contained and ‘pad’ the interval before and after those dayswith hours and minutes to complete the window.

is approach can not be expressed in the formal model itself. eonly way to implement this approach is to create an activity that imple-ments this logic internally and reads from multiple input views selec-tively the required pieces of an aggregate.

An important concern in [ ] is the efficiency of lookups; the formalmodel does not facilitate reasoning about this.

. Shared Join Rule

Another relational operator that has attracted much attention in contin-uous query processing is the join operator. It is used to join together datafrom different sensors. In the ‘Monitoring a River’ example, ow mea-surements and temperature measurement have to be paired with eachother within a time window because in uxes can only be detected bylooking at abnormal combinations of the two.

An observation made by [ ] is that when two joins have the samepredicate, the output tuples of the join with the smallest window are

Page 74: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 64 ��

�� ��

��

Before transformation (lhs):

.

.

....Vin1 ...Joinpred ... .Vout1

....Vin2 ...Joinpred ... .Vout2

.w1

.w1

.w2

.w2

After transformation (rhs):

.

.

....Vin1 . .

. ...Joinpred ...

....Vin2 . . ...σδ≤ws(w2)... .Vout2

.wmax

.wmax .wmin

.Vmax ≡ Vout1

Preconditions:w1 and w2 align.ws(w1) = ws(w2)

De nitions:δ = |valid on1 − valid on2| for each tuple

wmax =

{w1 if ws(w1) > ws(w2)

w2 otherwise

wmin =

{w2 if ws(w1) > ws(w2)

w1 otherwise

Figure . : Shared Join Rule

Page 75: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 65 ��

�� ��

��

contained in the output tuples of the join with the biggest window.e Shared Join Rule uses this observation to optimize Join activities

in streaming work ows.

e Rule e Shared Join Rule is shown in Figure . . e biggestJoin is used as starting point. In the output tuples of the Join, thevalid on timestamps of both joined input tuples are preserved. Basedon the distance between those two timestamps, an Select activity can l-ter out any tuple that would not haven been generated by the Join withthe smallest window. ( us it is not possible to transform tuple-basedJoin activities!)

e rule preserves strict equivalence. For any views on either side ofthe transformation an equivalent view is present on the other side [De -nition ]. Intuitively, this is true because there is clearly a subsumptionrelation between Vout1 and Vout2.

FormalModelLimitations For Join activities, the most natural win-dow sequence would seem to be tuple trigger, but with a time-basedwindow size. is is at the moment not possible in the formal model.

Formal Model Limitations In [ ] various algorithms are presentedthat differ in performance characteristics. e Shared Join Rule repre-sents the most naive and most unfair method. It is considered unfairbecause big window joins can delay small window joins when executedtogether.

An alternative algorithm is presented that gives precedence to thejoin with the smallest window. is algorithm produces the tuples forthe smallest join rst and later produces the additional tuples for thebigger join.

e formal modal assumes that the work ow system is in nitely fast[Assumption I] and assumes sequential execution of dependent activi-ties [Assumption II]. For a work ow on which the Shared Join Rule hasbeen applied this means that the Join activity will always be completedbefore the ltering is started. Because every single activation of an ac-tivity is batch-based and activities cannot produce intermediate results,the fairer algorithm can not be expressed in the formal model.

Page 76: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 66 ��

�� ��

��

Page 77: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 67 ��

�� ��

��

Chapter Seven

Related Work

is chapter presents related work on streaming work ows and con-tinuous querying. In the rst section some remarks are made on thecomplications of comparing the formal model with related work. eother sections discuss transformations and optimizations presented inrelated work.

. Complications

Comparing related work with the formal model can turn into an un-bounded discussion for two reasons:

. e Formal Model is Turing Complete e formal model pre-sented in Chapter puts no limits on what an activity can do. As aconsequence the formal model is Turing complete: any computationalprocess can be expressed. Figure . takes this to the extreme and showsthe ultimate streaming work ow.

....Vinput ..Do Anything You Want.wall.. .Voutput

Figure . : e Ultimate Work ow.

On the one hand it is clear that the extreme case in Figure . isnot desirable; on the other hand, treating activities as black boxes is adeliberate design choice of the formal model.

. e Formal Model is an Intermediate As described in Section .and illustrated by Figure . on page , the formal model is not in-tended to be a user interface (∼ language) and the formal model is nei-

Page 78: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 68 ��

�� ��

��

ther an execution plan or implementation blueprint. It is an intermedi-ate abstraction intended to ease transformation and optimization.

Because the formal model is Turing complete, any approach or tech-nique presented in related work can be mapped to the formal model.Because the formal model is an intermediate abstraction, adding an ad-ditional mapping (above or below in the system stack) is always possible.

e real question is in which cases this reasoning is acceptable.

. Query Graph Transformations

e formal model represents one or more queries as a big operator graph.e system can be optimized by applying transformation rules to the op-

erator graph. Several such transformation rules are described in chapters, and ; those chapters are not intended to be a complete.

Below are some transformations of the operator graph as describedin related work. Most of them can easily be rewritten as transformationrules in the formal model.

Everything from Relational Algebra e STREAM system [ , ]adds streams as an extra type to standard relational processing. is isdone in a way that preserves all the standard relational transformations.Also in the formal model many existing relational transformations canbe expressed thanks to the Masquerade Rule from Section . .

Filter-Window Commutativity A suggested transformation of thequery graph in STREAM combines streams and relations: a selectionoperator is commutative with the ltering done by a window because awindow is essentially a lter. is type of transformation could be easilydescribed in the formal model.

Expression Signatures In NiagaraCQ [ ] queries are grouped basedon the signature of subexpressions. For example two select operatorsthat only have a different constant will be grouped and executed to-gether. e output of the grouped operator is split using a table thatrecords these constants.

Grouping of exactly identical subexpressions happens in the formalmodel through zipping (see sections . and . ). Transformation rules

Page 79: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 69 ��

�� ��

��

can be used to extract subexpressions from a query that is identical toa subexpression from another query. Continuing the example of twoselect operators: de ne a transformation rule that creates the disjunctof multiple predicates as an identical subexpression and and splits theresults afterwards. e table used by NiagaraCQ to split the output isin the formal model solution embedded in the query graph.

Inserting, Combining and Reordering e Aurora system [ ] usesthe boxes and arrows paradigm, not only to represent the query, but alsoas an interface for the user. Aurora can apply various optimizations tothe query graph. . inserting additional projections to minimize storageand communication costs, . combining sequential boxes into a singlebigger box to eliminate overhead and . reordering of boxes to pushselection down.

ese are transformations that can be easily described in the formalmodel. e rst requires the knowledge of which attributes are used byan activity; for each type of (custom) activity this needs to be describedin a separate rule.

. Adaptive Processing

An entirely different approach to streaming query processing are Ed-dies [ ], which have been implemented in the TelegraphCQ [ ] system.Instead of treating one or more queries as a big operator graph, the sys-tem associates lineage information with each tuple. Based on whichoperators a tuple has visited, which operators can be visited next and forwhich queries is a tuple dead, the Eddy routes each tuple individuallythrough all required operators until it is complete for all queries.

SteMs [ ] are an extension of the idea of Eddies in combinationwith joins. A SteM operator embodies the internal state that a join op-erator needs to maintain for each input relation. By adding a few rout-ing restrictions to the Eddy, a join can be computed by sending tuplesthrough the correct SteM operators. SteMs break the encapsulation ofa classic join operator, which has two advantages: . the internal state ofjoins can be shared and . the Eddy can better adapt because the physicaloperations are better observable.

at storing the lineage of a tuple is not only useful for adaptivestream processing but can also be used to eliminate duplicate and use-

Page 80: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 70 ��

�� ��

��

less processing is demonstrated in [ ]. By including the evaluation ofpredicates in the tuples lineage, they prevent needless tuples from beinggenerated by joins and they prevent repeatedly evaluating the same (ex-pensive) predicate.

Storing the lineage of a tuple would be no problem in the formal modelsince activities are free to choose the schema of their output tuples.Adaptive routing and SteMs however could be a feature of a StreamingWork ow System, but the semantics can not be expressed in the formalmodel. Instead it could be used as a way to implement joins. e formalmodel can then be used to group joins on the same physical machine.

A problem with SteMs is that views in the formal model are purelyrelational and can not easily be used as ‘scratchpad’ for activities.

. Optimization Criteria

Computational cost is just one criterium for which one or more queriescan be optimized. ere are other motives to transform a query plan.

e formal model was intended to be agnostic of the optimization cri-terium, so this section details some criteria from related work.

Communication Cost e COSMOS [ ] system tries to minimizethe communication cost in a distributed stream processing system byquery rewriting and merging query results. Communication cost is aproblem because data rates in stream processing can be very high and ina distributed system communication can be intercontinental.

Because the formal model does not support multiple output streamsfrom a single activity, many transformations de ne an activity that cre-ates a superset of the results of the transformed activity and split thissuperset using additional select activities. is makes it very easy tominimize communication costs in the formal model because a mergedresult stream is often already available.

In [ ] an algorithm is presented that minimizes data transmissionsin sensor networks by locating operators at appropriate nodes in thenetwork. e rational for such optimization is presented in [ ] byclaiming that CPU use is cheaper (in terms of power consumption ofembedded devices) than communication.

Page 81: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 71 ��

�� ��

��

Pushing selection and other processing steps down to leafs of thestreaming work ow graph is not a problem to express in the formalmodel.

e SPADE [ ] system tries to optimize for minimum communi-cation because communication is the main cause of latency in the sys-tem.

Fairness of Optimizations Although a transformation may lead toglobal optimization, for individual queries the effects can be negative.In Section . the Shared Join rule is described. In [ ] an approachsimilar to the Shared Join rule is presented; they consider various joinalgorithms that have different effects on the original queries. eir par-ticular concern is the negative effect on the latency of the smallest of twojoins when two joins are combined. After being combined, the querywith the smaller join will produce results slower than originally becauseit has to wait for the bigger join.

ese fairness considerations can not easily be expressed in the for-mal model since it assumes an in nitely fast system (Assumption I). ismay be an acceptable limitation when considering non time critical sys-tems, where the reduction of the total amount of work is the biggestconcern.

. Provenance

e formal model has a strong focus on exact and traceable results. isis due to the setting under which this thesis project started. In the ex-ample from Section . , the Streaming Work ow System is introducedas a scienti c tool in an academic setting. Reproducible results are thecornerstone of sound academic research.

e thesis of Reinier-Jan de Lange [ ] on provenance in sensornetworks has clearly contributed to the focus on exact and traceable re-sults. is focus is most explicit in Design Choice III, which states thatwork ows must be immutable for the sake of provenance/governance.

In related work on streaming query processing, provenance does notseem to be a concern. Rapid and predictable response-times are muchmore important and results do not have to be exact and reproducible.

Page 82: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 72 ��

�� ��

��

. Approximation

Approximation is very a very natural thing to do, as sensor data itselfis an approximation of reality. Due to the focus on exact and repro-ducible results, the formal model has not considered approximations.But because the topic is so common in related work, a section here isstill warranted.

Using approximations improves the performance of a system in sev-eral ways. In DataCube [ ] it is observed that many of their users donot require exact results and often use statistical approximations. eseapproximations can be easily maintained incrementally, which is notpossible for the exact holistic aggregates. Making algorithms over slid-ing windows incremental is the focus of [ ].

A speci c use of approximations is coping with peak load. In [ ]the TelegraphCQ [ ] system is extended with load shedding. Addi-tional queues are inserted between data sources and the query. Whenthe system can not keep up, these queues over ow and tuples must bedropped. e dropped tuples are synopsized and a shadow query planprocesses these synopsizes. e estimated results from the synopsis tu-ples are merged with the exact results from the original query and pre-sented to the user together.

A optimization problem with approximations is that approximationsoften require building internal state. In [ ] it is suggested to not onlyshare intermediate results, but also the synopsizes build by operators. Inthe formal model it is hard to share synopsizes because of the structureof views.

Page 83: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 73 ��

�� ��

��

Chapter Eight

Conclusions

is chapter will rst give an overview of the ‘Monitoring a River’ ex-ample that shows how the work presented this thesis can be useful.

en answers to the research questions are given, followed by the resultsand contribution. In Section . a suggested approach towards actuallybuilding an optimizer for a Streaming Work ow System is presented.

. Monitoring a River: Overview

roughout the previous chapters the example of a scientist who wantsto detect ground water in uxes along a river has been used to illustratethe Streaming Work ow System, the formal model and transformationrules.

In Section . the example was introduced and in Section . astreaming work ow was presented that processes ow and temperaturemeasurements. e work ow showed that the formal model can indeedbe used to describe the data processing required in the example.

Chapter presented generic transformation rules that can transformstreaming work ows based on their abstract structure and (window se-quence) characteristics. To illustrate the usefulness of these rules, amore abstract and hypothetical example was introduced in Section . .

e example showed that two individual streaming work ows can sharecomputation through application of the right transformation rules inthe right order. is abstract example also illustrates the (hypothetical)situation that would exist within the runtime of a Streaming Work owSystem.

In Section . the river example was used to show how the StreamingWork ow System can use the transformation rules presented in Chap-ter to optimize a streaming work ow as it evolves during a researchproject.

Page 84: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 74 ��

�� ��

��

In Chapter transformation rules for tuple-based activities were pre-sented. Although the example did not include those explicitly, they arenonetheless indispensable for streaming work ow processing. e in-troduction to Chapter describes this.

Aggregate functions and joins are a key component for streamingwork ows; in the ‘Monitoring a River’ example aggregates account forthe majority of activities. Transformation rules for those activities aredescribed in Chapter .

e ‘Monitoring a River’ example has demonstrated that the formalmodel and transformation rules presented in this thesis are up to thetask of modeling and transforming a streaming work ow.

. Answers to the Research Questions

is section answers the research questions postulated in Section .

. How must the (existing) formal model be adapted to allow fortransformation? Chapter has explained the new formal model.

e big change with respect to the original model is the pre-dictability of activities. e old model used triggers to determinewhen an activity is activated. e new model replaces them withwindow sequences that are fully predictable.Another change is the de nition of deterministic activities. De-sign Choice IV de nes ‘stream determinism’, which is weakerthan the determinism in the old model but is sufficient for thetransformations presented.Other additions to the model include the concept of fresh tuplesand the de nition of a standard window sequence for tuple basedactivities.

. What is a suitable de nition of equivalence between work ows?In Chapter equivalence of activities, views and work ows wasde ned. In the following chapters it was shown that this de -nition is suitable: the de nition captures the purpose of an opti-mizer and the de nition can be used to proof that transformationrules preserve equivalence.

e de nition of work ow equivalence was split into strict andweak equivalence to make the asymmetry between theory and

Page 85: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 75 ��

�� ��

��

practice explicit. e weak equivalence of two work ows is de-ned with a subset of the views in those work ows to honor the

fact that in practice equivalence is not symmetrical and the factthat an equivalent view can be recreated when needed.

. Can transformation rules be de ned for common types of activ-ities? Chapters , and have shown various transformationrules. For many rules a proof was provided and for others an in-dication was given how a similar proof can be constructed. isshows that the rules are valid and that the de nition of equiva-lence from Chapter is suitable.

Rules were presented for generic situations, for common (rela-tional) activities and for aggregates and joins. It was shown thatcommon activity types can be described and that existing contin-uous querying research can be described in the formal model.

. Results and Contribution

e result of this masters thesis project is a formal model for streamingwork ows and a set of transformation rules valid within that model.

e quality and quantity of the rules is such that the primary objectiveof this research is considered to be met.

Although the expressive power of the model has been reduced at thecost of predictable behavior, it is still more than adequate for modelingstreaming work ows. e added limitations and changes do not impairthe functionality of the model. In addition, the new formal model canbe interpreted as a subset of the original model. erefore the transfor-mation rules can be applied to any (sub-) work ow in the original modelthat happens to t in the subset de ned by the new model. Because pre-serving global equivalence is achieved by preserving local equivalence,only a subgraph of a work ow needs to t in the new formal model fortransformation rules to be applicable.

By adapting the model and de ning transformation rules, it has beenshown that automatic optimization of streaming work ows within anStreaming Work ow System is very well possible. e next section pro-poses how to continue towards this goal.

Page 86: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 76 ��

�� ��

��

. Suggested Approach to Building an Optimizer

In Section . , optimization was described as searching for an alterna-tive that is better than the original. In the same section it was madeclear that choosing the best alternative is outside the scope of this the-sis, limiting this thesis to searching for alternatives. e contribution ofthis thesis is actually smaller than that: this thesis presents transforma-tion rules to create an alternative streaming work ow from a given one.How to navigate the search space of all possible streaming work ows isstill open.

Below is a suggested approach to building an actual optimizer forthe Streaming Work ow System.

. First of all, build an state space explorer! Model checkers arebuild to explore huge state spaces efficiently, so pick a graph-basedmodel checker like G [ , ] and extend it with support fortransformation rules for streaming work ows.All the graph manipulation functionality is already available, sowhat needs to be added is support to express and manipulate win-dow sequences; therefore a concise vocabulary for window se-quence matching and manipulation will have to be chosen or de-veloped. e language will have to support expressing conceptslike alignment (De nitions and ) and to support introducingnew window sequences using simple arithmetic as in the Aggre-gate Split Rule (Figure . ).Since an optimizer assumes that all transformation rules preserveequivalence, it is not needed at rst to track which views areequivalent. But eventually it must be known which view in thesystem runtime maps to a view in the user-de ned work ow.

erefore support is needed to specify in a transformation rulewhich views are equivalent.

. Second, nd some work for the state space explorer. Compile aset of theoretical (isolated) work ows and transformations, basedon the transformation rules presented in this thesis and on thepossibilities of the chosen window sequence manipulation lan-guage. is set can be used to test the state space explorer andto explore the limitations of the formal model and of the windowsequence semantics.

Page 87: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 77 ��

�� ��

��

. ird, nd some real work for the state space explorer. Compilea set of work ows and transformation rules that are more or lessrealistic. Build larger compounded work ows from the isolatedwork ows, because in the runtime of a Streaming Work ow Sys-tem, all work ows submitted by the users are treated as a singlebig work ow with many outputs.

. Finally, build an optimizer. Pick a metric to optimize stream-ing work ows for. Compute this metric for all streaming work-ows generated by the state space explorer. Evaluate the metric

by comparing the original work ow with the work ow selectedas the best by the metric.

. Additionally, improve the optimizer. Use heuristics to constrainthe search space and to steer the search towards the optimum forthe metric, whilst still being able to nd an optimal streamingwork ow. is might be challenging to implement as it requiresmodi cation of the state space exploration algorithm.

e rst and second step can be started based on the results of thisthesis. e other three steps can be repeated for multiple applicationdomains.

Summary Create a set of work ows and transformation rules to testwith. Adapt an existing graph-based model checker to generate the fullsearch space (or at least up to a maximum search depth). De ne andevaluate metrics. Use heuristics to steer the state space exploration.

. Unanswered Questions

e previous section has presented an approach to actually build an op-timizer. is section presents some unanswered questions that futurework should answer.

HowCanNotStreamingDataBeModeled in theFormalModel? eway views are modeled in the current formal model is not suited to han-dle meta data repositories and other types of semi-static data. ey areneeded, for example, to record the calibration status of a sensor.

Another type of storage for which views are not very well suited isto share ‘scratchpad’ data between activities.

Page 88: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 78 ��

�� ��

��

Should the SystemHandleDelayedData? Handling delayed data (outof order) is either trivial or very complex. In certain domains users areaccustomed to the fact that sensors fail and die; the user is glad the dataarrived at all. In other domains this may not be the case, so the systemhas to mitigate late arriving data to a certain extend. Work ows that arenot used for real-time monitoring may incorporate calls to external webservices which can be slow or temporarily unavailable. e Borealis [ ]system provides dynamic provisions of query results.

. Evaluation

Foremost, this research has been a thought experiment and this thesisre ects that. e result of this research is an abstract but clear startingpoint for the approach suggested in Section . .

is thesis is light on technical detail and literature references be-cause building an optimizer has never been a goal of this research andneither has this research been intended to be a literature study intostream processing or scienti c work ows.

Page 89: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 79 ��

�� ��

��

Bibliography

[ ] Groove. http://groove.cs.utwente.nl/.

[ ] Daniel J. Abadi, Yanif Ahmad, Magdalena Balazinska, MitchCherniack, Jeong hyon Hwang, Wolfgang Lindner, Anurag S.Maskey, Er Rasin, Esther Ryvkina, Nesime Tatbul, Ying Xing,and Stan Zdonik. e design of the borealis stream processingengine. In In CIDR, pages – , .

[ ] Arvind Arasu, Shivnath Babu, and Jennifer Widom. An ab-stract semantics and concrete language for continuous queries overstreams and relations.

[ ] Arvind Arasu and Jennifer Widom. Resource sharing in con-tinuous sliding-window aggregates. In VLDB ’ : Proceedings ofthe irtieth international conference on Very large data bases, pages

– . VLDB Endowment, .

[ ] Ron Avnur and Joseph M. Hellerstein. Eddies: continuouslyadaptive query processing. SIGMOD Rec., : – , May

.

[ ] Don Carney, Ugur Cetintemel, Mitch Cherniack, Christian Con-vey, Sangdon Lee, Greg Seidman, Michael Stonebraker, NesimeTatbul, and Stan Zdonik. Monitoring streams: a new class of datamanagement applications. In Proceedings of the th internationalconference on Very Large Data Bases, VLDB ’ , pages – .VLDB Endowment, .

[ ] Sirish Chandrasekaran, Owen Cooper, Amol Deshpande,Michael J. Franklin, Joseph M. Hellerstein, Wei Hong, SaileshKrishnamurthy, Sam Madden, Vijayshankar Raman, Fred Reiss,and Mehul Shah. Telegraphcq: Continuous data ow processing

Page 90: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 80 ��

�� ��

��

for an uncertain world. In Proceedings of the CIDR Conference,.

[ ] Jianjun Chen, David J. DeWitt, Feng Tian, and Yuan Wang.NiagaraCQ: a Scalable Continuous Query System for InternetDatabases. pages – , .

[ ] Anita Dani and Janusz Getta. Conceptual modelling of compu-tations on data streams, .

[ ] Reinier-Jan de Lange. Provenance aware sensor networks for real-time data analysis. Master’s thesis, University of Twente, .

[ ] Bugra Gedik, Henrique Andrade, Kun-Lung Wu, Philip S. Yu,and Myungcheol Doo. Spade: the system s declarative streamprocessing engine. In SIGMOD ’ : Proceedings of the ACMSIGMOD international conference on Management of data, pages

– , New York, NY, USA, . ACM.

[ ] J. Gehrke and S. Madden. Query processing in sensor networks.Pervasive Computing, IEEE, ( ): – , .

[ ] Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman,Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pi-rahesh. Data cube: A relational aggregation operator generalizinggroup-by, cross-tab, and sub-totals. Data Mining and KnowledgeDiscovery, ( ): – , .

[ ] Moustafa A. Hammad, Michael J. Franklin, Walid G. Aref, andAhmed K. Elmagarmid. Scheduling for shared window joins overdata streams. In VLDB ’ : Proceedings of the th internationalconference on Very large data bases, pages – . VLDB Endow-ment, .

[ ] Sailesh Krishnamurthy, Michael J. Franklin, Joseph M. Heller-stein, and Garrett Jacobson. e case for precision sharing. In inVLDB, pages – , .

[ ] Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, andPeter A. Tucker. No pane, no gain: efficient evaluation ofsliding-window aggregates over data streams. SIGMOD Rec.,

( ): – , .

Page 91: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 81 ��

�� ��

��

[ ] Rajeev Motwani, Jennifer Widom, Arvind Arasu, Brian Babcock,Shivnath Babu, Mayur Datar, Gurmeet Manku, Chris Olston,Justin Rosenstein, and Rohit Varma. Query processing, resourcemanagement, and approximation in a data stream managementsystem. Technical Report - , Stanford InfoLab, .

[ ] Vijayshankar Raman, Vijayshankar Raman, Amol Deshpande,Amol Deshpande, Joseph M. Hellerstein, and Joseph M. Heller-stein. Using state modules for adaptive query processing. In InICDE, pages – , .

[ ] F. Reiss and J.M. Hellerstein. Data triage: an adaptive architec-ture for load shedding in telegraphcq. In Data Engineering, .ICDE . Proceedings. st International Conference on, pages

– , .

[ ] A. Rensink. e groove simulator: A tool for state space gen-eration. In J. L. Pfaltz, M. Nagl, and B. Bohlen, editors, Ap-plications of Graph Transformations with Industrial Relevance (AG-TIVE), volume of Lecture Notes in Computer Science, pages

– , Berlin, . Springer Verlag.

[ ] Utkarsh Srivastava, Kamesh Munagala, and Jennifer Widom.Operator placement for in-network stream query processing.In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, PODS ’ ,pages – , New York, NY, USA, . ACM.

[ ] A. Wombacher. Composable data processing in environmentalscience - a process view. In Proceedings of IEEE International Con-ference on Services Computing (SCC ), Honolulu, Hawaii, USA,volume , pages – , Los Alamitos, July . IEEE Com-puter Society Press.

[ ] Andreas Wombacher. Data Work ow — A Uni ed Time Con-trolled E-Science Processing Model. .

[ ] Yongluan Zhou, Karl Aberer, Ali Salehi, and Kian lee Tan. Re-thinking the design of distributed stream processing systems.

Page 92: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 82 ��

�� ��

��

Page 93: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 83 ��

�� ��

��

Page 94: Streaming Workflow Transformation

thesis January 27, 2011 14:03 Page 84 ��

�� ��

��

.

.

.Plot ..Vplots

.Model ..Vpredictions

..Vsum

.Avg ..Vavg

.Input ..Vconfiguration

.w3

.w4

.w4

.w3

.w4

..Vall

.∪ .w1

.w2

..T1

.T2.

.T3.

.we

.we

.we