A Code Generation Approach to Optimizing High …bgedik/homepage/lib/exe/fetch.php/...processing...

10
A Code Generation Approach to Optimizing High-Performance Distributed Data Stream Processing Bu˘ gra Gedik Thomas J. Watson Research Center IBM Research, Hawthorne, NY, 10532 [email protected] Henrique Andrade Thomas J. Watson Research Center IBM Research, Hawthorne, NY, 10532 [email protected] Kun-Lung Wu Thomas J. Watson Research Center IBM Research, Hawthorne, NY, 10532 [email protected] Abstract We present a code-generation-based optimization approach to bringing performance and scalability to distributed stream process- ing applications. We express stream processing applications using an operator-based, stream-centric language called SPADE, which supports composing distributed data flow graphs out of toolkits of type-generic operators. A major challenge in building such applica- tions is to find an effective and flexible way of mapping the logical graph of operators into a physical one that can be deployed on a set of distributed nodes. This involves finding how best operators map to processes and how best processes map to computing nodes. In this paper, we take a two-stage optimization approach, where an instrumented version of the application is first generated by the SPADE compiler to profile and collect statistics about the process- ing and communication characteristics of the operators within the application. In the second stage, the profiling information is fed to an optimizer to come up with a physical data flow graph that is deployable across nodes in a computing cluster. This approach not only creates highly optimized applications that are tailored to the underlying computing and networking infrastructure, but also makes it possible to re-target the application to a different hardware setup by simply repeating the optimization step and re-compiling the application to match the physical flow graph produced by the optimizer. Using real-world applications, from diverse domains such as finance and radio-astronomy, we demonstrate the effective- ness of our approach on System S – a large-scale, distributed stream processing platform. Categories and Subject Descriptors H.2.4 [Database Management]: Systems—Query processing General Terms Algorithms, Design, Performance Keywords Streaming Systems, Profile Driven Optimization Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CIKM’09, November 2–6, 2009, Hong Kong, China. Copyright 2009 ACM 978-1-60558-512-3/09/11 ...$10.00. 1. INTRODUCTION In an increasingly information-centric world, people and organi- zations rely on time-critical tasks that require accessing data from highly dynamic information sources and generating responses de- rived from on-line processing of data in near real-time. In many ap- plication domains, these information sources are taking the form of data streams, that are time-ordered series of events or sensor read- ings. Examples include live stock and option trading feeds in finan- cial services, physical link statistics in networking and telecommu- nications, sensor readings in environmental monitoring and emer- gency response, and satellite and live experimental data in scientific computing. The proliferation of high-rate streaming sources, coupled with the stringent response time requirements of stream processing ap- plications force a paradigm shift in how we process streaming data, moving away from the “store and then process” model of database management systems toward the “on-the-fly processing” model of emerging stream processing systems (SPSs). This paradigm shift has created strong interest in SPS-related research, in academia [21, 8, 4, 1, 6, 11] and industry [9, 14, 20, 2, 10] alike. Due to the large and growing number of users, jobs, and infor- mation sources, as well as the high aggregate rate of data streams distributed across remote sources, performance and scalability be- come key challenges in SPSs. In this paper, we describe a code gen- eration approach to optimizing high-performance distributed data stream processing applications, in the context of the SPADE lan- guage and compiler [10], and the System S distributed stream pro- cessing platform [2]. 1.1 Data Flow Graphs in System S System S applications are composed of jobs that take the form of data flow graphs. A data flow graph is a set of operators con- nected to each other via streams. Each operator can have zero or more input ports, as well as zero or more output ports. An operator that lacks input ports is called a source, and similarly an operator that lacks output ports is called a sink. Each output port creates a stream, which carries tuples flowing toward the input ports that are subscribed to the output port. An output port can publish to multiple input ports, and dually an input port can subscribe to mul- tiple output ports, all subject to type compatibility constraints. Data flow graphs are allowed to contain feed-back loops, that may form cycles in the graph. Data flow graphs can be deployed across the compute nodes of a System S cluster. The placement, scheduling, and other resource allocation decisions with respect to data flow graphs are handled autonomically by the System S runtime, whereas they can also be influenced by the users through knobs exposed by System S. Fig- 847

Transcript of A Code Generation Approach to Optimizing High …bgedik/homepage/lib/exe/fetch.php/...processing...

Page 1: A Code Generation Approach to Optimizing High …bgedik/homepage/lib/exe/fetch.php/...processing platform. Categories and Subject Descriptors H.2.4 [Database Management]: Systems—Query

A Code Generation Approach to OptimizingHigh-Performance Distributed Data Stream Processing

Bugra GedikThomas J. Watson Research

CenterIBM Research, Hawthorne,

NY, [email protected]

Henrique AndradeThomas J. Watson Research

CenterIBM Research, Hawthorne,

NY, [email protected]

Kun-Lung WuThomas J. Watson Research

CenterIBM Research, Hawthorne,

NY, [email protected]

AbstractWe present a code-generation-based optimization approach tobringing performance and scalability to distributed stream process-ing applications. We express stream processing applications usingan operator-based, stream-centric language called SPADE, whichsupports composing distributed data flow graphs out of toolkits oftype-generic operators. A major challenge in building such applica-tions is to find an effective and flexible way of mapping the logicalgraph of operators into a physical one that can be deployed on aset of distributed nodes. This involves finding how best operatorsmap to processes and how best processes map to computing nodes.In this paper, we take a two-stage optimization approach, wherean instrumented version of the application is first generated by theSPADE compiler to profile and collect statistics about the process-ing and communication characteristics of the operators within theapplication. In the second stage, the profiling information is fedto an optimizer to come up with a physical data flow graph thatis deployable across nodes in a computing cluster. This approachnot only creates highly optimized applications that are tailored tothe underlying computing and networking infrastructure, but alsomakes it possible to re-target the application to a different hardwaresetup by simply repeating the optimization step and re-compilingthe application to match the physical flow graph produced by theoptimizer. Using real-world applications, from diverse domainssuch as finance and radio-astronomy, we demonstrate the effective-ness of our approach on System S – a large-scale, distributed streamprocessing platform.

Categories and Subject DescriptorsH.2.4 [Database Management]: Systems—Query processing

General TermsAlgorithms, Design, Performance

KeywordsStreaming Systems, Profile Driven Optimization

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.CIKM’09, November 2–6, 2009, Hong Kong, China.Copyright 2009 ACM 978-1-60558-512-3/09/11 ...$10.00.

1. INTRODUCTIONIn an increasingly information-centric world, people and organi-

zations rely on time-critical tasks that require accessing data fromhighly dynamic information sources and generating responses de-rived from on-line processing of data in near real-time. In many ap-plication domains, these information sources are taking the form ofdata streams, that are time-ordered series of events or sensor read-ings. Examples include live stock and option trading feeds in finan-cial services, physical link statistics in networking and telecommu-nications, sensor readings in environmental monitoring and emer-gency response, and satellite and live experimental data in scientificcomputing.

The proliferation of high-rate streaming sources, coupled withthe stringent response time requirements of stream processing ap-plications force a paradigm shift in how we process streaming data,moving away from the “store and then process” model of databasemanagement systems toward the “on-the-fly processing” model ofemerging stream processing systems (SPSs). This paradigm shifthas created strong interest in SPS-related research, in academia [21,8, 4, 1, 6, 11] and industry [9, 14, 20, 2, 10] alike.

Due to the large and growing number of users, jobs, and infor-mation sources, as well as the high aggregate rate of data streamsdistributed across remote sources, performance and scalability be-come key challenges in SPSs. In this paper, we describe a code gen-eration approach to optimizing high-performance distributed datastream processing applications, in the context of the SPADE lan-guage and compiler [10], and the System S distributed stream pro-cessing platform [2].

1.1 Data Flow Graphs in System SSystem S applications are composed of jobs that take the form

of data flow graphs. A data flow graph is a set of operators con-nected to each other via streams. Each operator can have zero ormore input ports, as well as zero or more output ports. An operatorthat lacks input ports is called a source, and similarly an operatorthat lacks output ports is called a sink. Each output port createsa stream, which carries tuples flowing toward the input ports thatare subscribed to the output port. An output port can publish tomultiple input ports, and dually an input port can subscribe to mul-tiple output ports, all subject to type compatibility constraints. Dataflow graphs are allowed to contain feed-back loops, that may formcycles in the graph.

Data flow graphs can be deployed across the compute nodes ofa System S cluster. The placement, scheduling, and other resourceallocation decisions with respect to data flow graphs are handledautonomically by the System S runtime, whereas they can also beinfluenced by the users through knobs exposed by System S. Fig-

847

Page 2: A Code Generation Approach to Optimizing High …bgedik/homepage/lib/exe/fetch.php/...processing platform. Categories and Subject Descriptors H.2.4 [Database Management]: Systems—Query

A Sink

A Source

A multi-ouputport operator

A multi-inputport operator

Multiple subscribers on an output port

Multiple publishers on an input port

A chain-like segment

A bushy segment with forks and joins

Symmetry and repetition, a common pattern in distributed applications

One possible placementon four compute nodes

Figure 1: A sample Data Flow Graph in System S

ure 1 illustrates several concepts related to distributed stream pro-cessing and data flow graphs in the context of System S. In thispaper, we focus on compile-time optimizations that deal with cre-ation of efficient data flow graphs, which falls under the responsi-bility of the SPADE compiler, as discussed next. Optimizations thatrelate to the System S stream processing core [2], multi-job opti-mizing scheduler [25], and job management subsystem [13] are notdiscussed here.

1.2 Application Composition with SPADESPADE is a rapid application development front-end for System

S. It consist of a language, a compiler, and auxiliary support forbuilding distributed stream processing applications to be deployedon System S. It provides three key functionalities:− A language to compose parameterizable distributed stream pro-cessing applications, in the form of data flow graphs. The SPADE

language provides a stream-centric, operator-level programmingmodel. The operator logic could optionally be implemented in alower-level language, like C++, whereas the SPADE language isused to compose these operators into logical data flow graphs. TheSPADE compiler is able to coalesce logical data-flow graphs intophysical ones that are more appropriate for deployment on a givenhardware configuration, from a performance point of view. This isachieved by fusing several operators and creating bigger ones that“fit” in available compute nodes. As we will discuss shortly, howthis coalescing is performed is critical in creating high-performancedata stream processing applications.− A type-generic streaming operator model, which captures thefundamental concepts associated with streaming applications, suchas windows on input streams, aggregation functions on windows,output attribute assignments, punctuation markers in streams, etc.SPADE comes with a default toolkit of operators, called the rela-tional toolkit, which provides common operators that implementrelational algebra-like operations in a streaming context.− Support for extending the language with new type-generic, con-figurable, and reusable operators. This enables third parties to cre-ate application or domain specific toolkits of reusable operators.

For details of the SPADE language, we refer the reader to ourSPADE project overview paper [10].

1.3 Challenges in Creating Optimized StreamProcessing Applications

A major challenge in building high-performance distributedstream processing applications is to find the right-level of granu-larity in mapping operators to processes to be deployed on a setof distributed compute nodes. From now on, we refer to the con-tainer processes that host operators as Processing Elements, PEsfor short. The challenge of creating PE-level flow graphs for de-ployment, out of user-specified operator-level flow graphs, has twoaspects, namely flexibility and performance.

From a flexibility point-of-view, one should define an applica-tion using fine-granular operators, so that it can be flexibly de-ployed on different hardware configurations. Monolithic operatorsthat are “too big” make it very hard to port an application from asmall set of powerful nodes (like a Blade server) to a large set ofless powerful nodes (like a BlueGene supercomputer), as it requiresthe manual process of refactoring the user-code and is a significantburden on the developer. SPADE’s operator-level programming lan-guage is focused on designing the application by reasoning aboutthe smallest possible building blocks that are necessary to deliverthe computation an application is supposed to perform. While it ishard to precisely define what an operator is, in most application do-mains, application engineers typically have a good understandingabout the collection of operators they intend to use. For example,database engineers typically conceive their applications in terms ofthe operators provided by the stream relational algebra [5]. Like-wise, MATLAB [18] programmers have several toolkits at their dis-posal, from numerical optimization to signal processing.

From a performance point of view, the granularity of the PE-levelgraph sets the trade-off between making use of pipelined paral-lelism versus avoiding costly inter-process communication. For in-stance, at one extreme, it is undesirable from a performance stand-point to run 100 operators as 100 PEs on a single processor system,due to the excessive cost of tuple transfers between processes. Onthe other extreme, running 100 operators as a single PE on a multi-node configuration will make it impossible to take advantage ofhardware parallelism that might be available (e.g., from multi-corechips or muptiple nodes). The sweet spot for optimization is reallythe set of scenarios between these two extremes, where a healthybalance of fusion and pipelined parallelism is ideal in achieving

848

Page 3: A Code Generation Approach to Optimizing High …bgedik/homepage/lib/exe/fetch.php/...processing platform. Categories and Subject Descriptors H.2.4 [Database Management]: Systems—Query

high performance. As an application gets larger, finding such abalance manually becomes increasingly difficult and necessitatesautomatic creation of PE-level data flow graphs. SPADE’s fusionoptimizer can automatically come up with an optimized partition-ing of operators into PEs, in order to maximize the performance fora given hardware configuration.

Another major challenge in creating optimized stream process-ing applications is to understand the computation and communi-cation characteristics of the operators that form a data flow graph.Without such information, it is almost impossible to make informeddecisions in the optimization stage. The SPADE compiler addressesthis problem by generating instrumentation code under profilingmode, so that resource usage statistics of operators can be cap-tured and fed into the optimizer. Further complicating the prob-lem, such statistical information could potentially be highly skewedwhen the profiling is performed in non-fused mode. To addresssuch concerns, SPADE’s profiling framework is designed to be flex-ible enough to collect operator-level statistics under arbitrary fusionsettings and with little runtime overhead.

1.4 ContributionsIn this paper we devise an optimization framework aimed at find-

ing an effective and flexible way of mapping the logical graph ofoperators into a physical one that can be deployed on a set of dis-tributed nodes. Our contributions include:

• A two-staged optimization framework, where an instru-mented version of the application is first generated by theSPADE compiler to profile and collect statistics about the pro-cessing and communication characteristics of the operatorswithin the application. In the second stage, the profiling in-formation is fed to an optimizer to come up with a physicaldata flow graph that is deployable across nodes in a comput-ing cluster.

• A code generation scheme to create containers that fuse op-erators such that stream connections between the operatorswithin the same container are reduced to function calls, mak-ing them highly efficient compared to streams that cross con-tainer boundaries.

• A profiling scheme to learn computation and communicationcharacteristics of the operators in a data flow graph. In profil-ing mode, SPADE’s compiler emits instrumentation code thatnot only collects these statistics with little runtime overhead,but also achieves that in a fusion-transparent manner. Thatis, the same level of information could be collected irrespec-tive of how the application is fused.

• An optimizer to come up with an effective partitioning of theoperators into PEs. We propose a greedy optimizer, whoseprimary heuristic is to fuse operators until the overhead oftuple transfers is an acceptable fraction of the total cost ofexecuting all of the operator logic within the PE.

In summary, the reliance on code generation provides the meansfor not only creating highly optimized applications that are tailoredto the underlying computing and networking infrastructure, but alsore-targeting the application to a different hardware configuration bysimply repeating the optimization step and re-compiling the appli-cation to match the physical flow graph produced by the optimizer.In most cases, applications created with SPADE are long-running,continuous queries. Hence the long runtimes amortize the buildcosts. Nevertheless, the SPADE compiler has numerous features tosupport incremental compilation, reducing the build costs as well.

Using real-world applications, from diverse domains such as fi-nance and radio-astronomy, our experimental results demonstratethe effectiveness of our approach on System S.

Finally, it is important to note that, in this paper, the only dataflow graph manipulation we consider is coalescing. We do not ex-plore the possible re-orderings of the operators in the data flowgraph. Although such optimizations are commonly applied toquery trees in relational databases [22], it is not straightforwardto extend them to System S, since in the general case, no assump-tions can be made about the semantics of the operators that forma data flow graph. Data flow graphs may contain large numbers ofuser-defined operators or type-generic operators from third-partytoolkits.

2. OPERATOR FUSIONSPADE uses code generation to fuse one or more operators into a

PE (Processing Element). PEs are units of node-level deploymentin System S. A PE is a container for operators and runs as a separateprocess. At any given time, a running PE is tied to a compute node,although it can be relocated at a later time. Multiple PEs can berun on a given compute node, and a single multi-threaded PE cantake advantage of different cores on the same compute node. In thissection, we look into how PEs are generated and how they operate.

2.1 Container GenerationSPADE’s PE generator produces container code that performs the

following fundamental functionalities:

1) Popping tuples from the PE input queues and sending themto the operators within

2) Receiving tuples from operators within and pushing theminto the PE output queues

3) Fusing the output ports of operators with the input ports ofthe downstream operators using function calls

PEs have input and output ports, just like operators. There isa one-to-one mapping between the PE-level ports and the exposedports of the operators contained inside the PE. An operator-leveloutput port is exposed if and only if at least one of the followingconditions are satisfied: i) The output port publishes into an inputport that is part of an operator outside this PE; ii) the output portgenerates an exported stream. Streams can be exported by the ap-plication developer to enable other applications to dynamically tapinto these streams at runtime. This is especially useful for incre-mental application deployment, where multiple applications can bebrought up at different times, yet can communicate with each otherusing dynamic stream connections.

Dually, an input port is exposed if and only if at least one of thefollowing conditions are satisfied: i) The input port subscribes to anoutput port that is part of an operator outside this PE, ii) the inputport subscribes to an imported stream. Imported streams have asso-ciated import specifications that describe the exported streams theyare tapping into. An input port connected to an imported streamneeds to be exposed at the PE level, so that the dynamic streamconnections can be established during runtime. Ports that are notexposed at the PE level are internal to the PE and are invisible toother PEs or applications.

Unlike operator ports, PE ports are attached to queues. Thisalso points out an important difference between PE-level andoperator-level connections. Crossing a connection from a PEoutput port to a PE input port involves queuing/dequeuing, mar-shalling/unmarshalling, as well as inter-process communication.

849

Page 4: A Code Generation Approach to Optimizing High …bgedik/homepage/lib/exe/fetch.php/...processing platform. Categories and Subject Descriptors H.2.4 [Database Management]: Systems—Query

The latter can involve going through the network, in case PEs arelocated in different nodes. In contrast, connections between op-erator ports are implemented via function calls and thus are muchcheaper compared to connections between PE ports.

Note that the fusion of operators with function calls results ina depth-first traversal of the operator subgraph that corresponds tothe partition of operators associated with the PE, with no queuinginvolved in-between. However, the existence of multiple threadswithin a PE can complicate this otherwise simple execution model.Next, we explore the details of the execution of operators within aPE.

2.2 Threading and Execution ModelAn operator container generated by SPADE is driven by a main

PE thread. This thread checks all input queues for tuples, and whena tuple is available from a PE input port, it fetches it and makes acall to the process function of the associated input port of the op-erator that will process the tuple. Depending on the details of theoperator logic, the depth first traversal can be shortcut in certainbranches (e.g., an operator filtering the tuple or buffering it withoutgenerating output tuples) or result in multiple sub-traversals (e.g.,an operator generating multiple output tuples). This design entailsnon-blocking behavior in process functions associated with the in-put ports of an operator.

As noted earlier, SPADE supports multi-threaded operators, inwhich case the depth-first traversal performed by the main PEthread will be cut short in certain branches and more importantlyother threads will continue from those branches, independently.This has an important implication: The process functions associ-ated with input ports of an operator can be executed concurrently,and as a result code for stateful operators should be written in athread-safe manner. For user-defined operators, SPADE can gener-ate code to automatically protect the process functions to providethread-safety (as an optimization, such locks are not inserted intothe code if an operator is not grouped together with other opera-tors and is part of a singleton PE). As an alternative, finer grainedlocking mechanisms can be employed by the operator developers.

PEs can contain any type of operators, including source opera-tors. A source operator is special in the sense that its processing isnot triggered by the arrival of an input tuple. Source operators havetheir own driver process functions. As a result, SPADE-generatedPE containers start a separate thread for each source operator withinthe PE. A source operator thread will call the source operator’s pro-cess function, which will drive the processing chain rooted at thesource operator. The source operator will not release the controlback to the calling context until termination.

Since SPADE supports feedback loops in the data-flow graph, anoperator graph is not necessarily cycle-free, which opens up thepossibility of infinite looping on a cycle within a composite PE(which will result in running out of stack space). The rationale be-hind allowing feedback loops in SPADE is to enable user-definedoperators to tune their logic based on feedback from downstreamoperators (e.g., by refining a classification model). Under operatorfusion, SPADE expects feedback links to be connected to non-tuple-generating inputs. This guarantees cycle-free execution under op-erator fusion. However, it is the developer’s responsibility to ensurethat the feedback links are not connected to tuple-generating inputs.

Figure 2 depicts a PE generated by SPADE via fusion of 6 opera-tors. The threads involved in the execution of the PE logic are illus-trated with circling yellow arrows. The main PE thread is shownnext to the input queues. Source operator O5 has its own thread.O3 is a multi-threaded operator, and its first output port is drivenby its additional thread. Note that the operators O1, O2, O4 and O6

O1

O4

O5

O2

O3

O6

A source operator (always driven by a separate thread)

The main PE thread (drives all the inputs)

Input queuesOutput queues

A sink operator

A multi-threaded operator

A feedback loop

Figure 2: A sample Spade container with fusion

are subject to execution under more than one thread. This particularPE has 3 threads in total.

3. PROFILING FRAMEWORKA powerful profiling framework is a key ingredient for an effec-

tive optimization approach. As a result, SPADE’s compiler-basedoptimization is driven by automatic profiling and learning of appli-cation characteristics.

SPADE’s profiling framework consists of three main compo-nents, namely code instrumentation, statistics collection, andstatistics refinement. Code instrumentation is used at compile-timeto inject profiling instructions into the generated spade processingelements. Statistics collection is the runtime process of executingthose instructions to collect raw statistics regarding communicationand computation characteristics of operators. Statistics refinementinvolves post-processing these raw statistics to create refined statis-tics that are suitable for consumption by the fusion optimizer.

SPADE’s profiling framework is designed around the followingthree goals, which are crucial in achieving effectiveness, flexibility,and efficiency:

− Statistics should be collected at the operator/port level

− The profiling infrastructure should be fusion transparent

− The profiled applications should not suffer from the observereffect

Collecting statistics at the operator/port level enables the fusionoptimizer to reason about different fusion strategies, since PE levelstatistics can be computed by composing operator level ones. Fu-sion transparency provides us with the flexibility to collect the sametype of statistics regardless of the fusion mode. This also makes itpossible to perform multiple profile/optimize/re-fuse steps to fur-ther revise the fusion strategy. However, in this paper we mostlyfocus on results from a single iteration of this step. Finally, theprofiling performed by SPADE must be light-weight, so as not tochange the application behavior under instrumentation.

3.1 Code InstrumentationSPADE instruments the generated containers using submit sig-

nals. For each operator-level output port in the container, there isone submit signal which keeps a list of all the calls to the processfunctions of the downstream operator inputs subscribed to the out-put port. There is also a submit signal for each PE level input port.Submit signals are used as instrumentation points, and are used tocollect the following port-level metrics:

850

Page 5: A Code Generation Approach to Optimizing High …bgedik/homepage/lib/exe/fetch.php/...processing platform. Categories and Subject Descriptors H.2.4 [Database Management]: Systems—Query

O2

O1

O3

start O1’s submit0 timerstart O2’s process0 timercall O2’s process0

stop O2’s process0 timerstart O3’s process0 timecall O3’s process0

stop O3’s process0 timerstop O1’s submit0 timer

Figure 3: Instrumented PE containers

1) Number of tuples seen by an input/output port

2) Size of tuples seen by an input/output port

3) CPU time taken to submit a tuple via an output port

4) CPU time taken to process a tuple received by an input port

In order to measure CPU time it takes to perform a submit call,SPADE starts a timer when it receives a tuple at the submit signaland stops the timer when all the calls to the process functions arecompleted. It also starts similar timers for each process functioncall attached to the submit call. For instance, in Figure 3 the sub-mit signal contains two process calls and as a result employs threetimers, one for the submit call as a whole, and one for each of theprocess calls. These timers are implemented using CPU counters.As we will discuss shortly, SPADE instrumentation does not collectthe aforementioned metrics on a per-tuple basis, and uses samplingto decrease the cost of profiling.

SPADE also maintains a separate base CPU time metric for eachthread in the container, other than the main PE thread.

3.2 Statistics CollectionFor each metric collected at a submit signal, SPADE maintains

not one but a number of samples. For instance, we will have severalvalues representing CPU time taken to submit on an output port,each for a different tuple. These values are timestamped to identifythem uniquely. As a result, statistical summaries can be computedfrom these metrics.

SPADE uses two methods to reduce the computation and memoryoverhead of profiling, namely stride sampling and reservoir sam-pling. Stride sampling is used to decide for which tuples we wantto collect metrics. A sampling fraction of s ∈ (0, 1] will result incollecting metrics once in every �1/s� tuples. For these tuples wewill collect metrics, but since we cannot keep all metrics in mem-ory in a streaming system we use a reservoir sampling approach todecide which ones to keep. With reservoir sampling each metricvalue collected has an equal chance of ending up in the reservoir.A reservoir size of S is used to limit the memory overhead. Thereservoir sampling algorithm works as follows [24]:

Given a new value, which is the ith one so far (i ≥ 0), performthe following action: If i < S add the new value into the reservoir,else (i >= S) with probability S/i replace a random item from thereservoir with the new value (drop it with probability 1− S/i).

Note that the stride sampling reduces the computation overhead,whereas the reservoir sampling reduces the memory overhead ofprofiling. For instance, with s = 0.001 and S = 500, we willcollect metrics once in every 1000 tuples and we will only keep a

Process time – Submit time

Process time

Submit time

Figure 4: A sample execution flow

random subset of 500 values from the metric values collected sofar.

3.3 Statistics RefinementRecall that one of the goals of profiling was to collect metrics at

the operator/port level. Fusion makes this task challenging, sincethe CPU time for processing a tuple on a given input port not onlycounts the amount of time spent executing the associated operator’slogic, but also any downstream processing that may take place. Forinstance, Figure 4 shows one possible execution flow and the asso-ciated submit and process times. A simple approach for extractingthe time spent for executing the operator logic is to subtract thetime spent on downstream processing on a given output port, calledsubmit time, from the total time measured for the process call at theinput port, called process time. However, this is not possible with-out very heavy instrumentation, since the execution flow followinga process call on an input port is operator logic dependent. For in-stance, the tuple processing may result in zero or more submit callson one or more output ports. Without tracking which submit callsare triggered by which process calls, it is impossible to compute theCPU time spent within a given operator’s scope on a per-port basis.Multi-threading further complicates the situation.

Fortunately, one can compute the average amount of CPU timespent by an operator by refining the raw metric values described inSection 3.1. Concretely, we perform a post-processing step afterraw metric values are collected during a profiling run to create thefollowing refined statistics:Data transfer rate: For each input (output) port, we compute therate in bytes/sec, denoted by r�

i (r�i ) for the ith input (output) port.

Similarly, we compute the rate in tuples/sec, denoted by c�i (c�

i ) forith input (output) port.CPU Fraction: For each operator, we compute the fraction of CPUit utilizes, denoted by u ∈ [0, N ] where N is the number of CPUson a node. The CPU fraction is computed by aggregating the pro-cess and submit time metric values. Before giving the details of itscomputation, we first introduce some notation.

Let us denote the CPU time spent on processing a tuple on ithinput port as t�

i and similarly, CPU time spent on submitting a tupleon ith output port as t�

i . Let k denote the number of input ports, andl denote the number of output ports. Recalling that an operator canhave additional threads, m of them in the general case, we denotethe CPU fraction taken up by the ith thread as bi, which can betrivially computed using the base CPU time metric from Section 3.1and the associated timestamps.

Finally, we compute the CPU fraction u for an operator at hand,

851

Page 6: A Code Generation Approach to Optimizing High …bgedik/homepage/lib/exe/fetch.php/...processing platform. Categories and Subject Descriptors H.2.4 [Database Management]: Systems—Query

O2 O2O1

PE1

O2O1

PE1

O1

PE1 PE2

s sO2O1sO2O1 rr r+ =

Figure 5: Fusion is not cost-wise additive

as follows:

u =

mX

i=1

bi +

kX

i=1

c�i · t�

i −lX

i=1

c�i · t�

i (1)

Equation 1 is interpreted as follows. First we add up the fractionof the CPU used by any threads that the operator might have. Thenfor each input port, we add the fraction of the CPU spent for exe-cuting the associated process calls. Finally for each output port, wesubtract the fraction of the CPU spent for executing the associatedsubmit calls. The former is approximated by c�

i · t�i for the ith input

port. c�i is the number of tuples processed within a second, where

as t�i is the average CPU time in seconds, spent for executing a pro-

cess call. This average is computed using the N metric values thatwere stored in the reservoir during the profiling run. The fractionof the CPU spent for executing submit calls is computed similarly(c�

i · t�i for ith output port).

3.4 Container-level StatisticsSo far we have discussed operator/port-level statistics. However,

the fusion optimizer also needs to know about the cost of sendingand receiving tuples at the PE boundaries. Consider a simple sce-nario depicted in Figure 5, where we have two operators, O1 andO2, connected via a stream, and we consider two alternative fu-sion strategies: two separate PEs versus a single composite PE. Forbrevity, assume that these operators have a selectivity of 1 and donot change the size of the tuple as they propagate it. In the firstalternative, the cost of processing a tuple is equal to the cost of re-ceiving a tuple, executing the operator logic, and sending it, that isCr + C(O1) + Cs for operator O1 and Cr + C(O2) + Cs for op-erator O2. However, when we sum up these costs we overestimatethe cost of the second alternative, since the actual cost for the latteris Cr +C(O1)+C(O2)+Cs, assuming that the cost of a functioncall is negligible compared to Cr , Cs, C(O1), and C(O2).

In summary, the fusion optimizer needs to know about the pro-cessing cost involved in sending and receiving tuples to reasonabout the cost of different fusion strategies. It is important to notethat the cost of sending and receiving tuples mostly depend on therate at which the tuples are being sent/received and their sizes. As aresult, SPADE maintains an application-independent mapping of 〈rate (tuples/sec), tuple size (bytes) 〉 pairs to CPU fraction mapping(v : R

+ × N+ → [0, N ]), which is used for all applications. This

mapping needs to be re-adjusted only when the hardware changes1.

4. THE FUSION OPTIMIZERThe goal of fusion optimization is to come up with a PE-level

data flow graph using the statistics collected as part of the profil-ing step about the communication and computation characteristicsof the operators, as well as the application-independent statistics

1It can be scaled using the new processor’s MIPS value divided bythe MIPS value of the processor used to create the mapping.

regarding the cost of sending and receiving tuples at the PE bound-aries. Deployment of the resulting PE-level data flow graph shouldprovide better throughput, compared to the naïve approaches of cre-ating one PE per operator or fusing all operators into one PE, andmore importantly compared to manual fusion done by applicationdesigners (which is only practical for small-scale applications).

4.1 Problem FormulationLet O = {O1, · · · , On} denote the set of operators in the data

flow graph. Our goal is to create a partitioning, that is a set of parti-tions P = {P1, · · · , Pm} where each partition is a set of operators(Pi ⊂ O, ∀Pi ∈ P), such that this partitioning is non-overlapping(∀i�=j∈[1..m], Pi ∩ Pj = ∅) and covering (∪i∈[1..m]Pi = O). Eachpartition represents a container PE to be generated by the SPADE

compiler as described in Section 2.There are two main constraints in creating the partitioning P .

First, the total CPU fraction used by a partition should not exceeda system specified threshold, say MaxFrac. Let us denote thecomputational load of a partition Pi by CompLoad(Pi), where:

CompLoad(Pi) = OperLoad(Pi) + CommLoad(Pi).

OperLoad represents the computational load due to executing theoperators within a single PE, that is:

OperLoad(Pi) =X

Oj∈Pi

u(Oj).

u(Oj) is the CPU fraction used by operator Oj , as described inSection 3.3. CommLoad represents the communication load in-curred due to sending and receiving tuples at the PE boundaries,which is computed using the rates, tuple sizes, and the container-level statistics described in Section 3.4. Let Rate(Pi) be the inter-PE communication rate for partition Pi and Size(Pi) be the av-erage tuple size. Using the mapping v from Section 3.4, we cancompute:

CommLoad(Pi) = v(Rate(Pi), Size(Pi)).

We refer to partition as saturated iff its computational load is aboveMaxFrac, that is:

Saturated(Pi) ≡ CompLoad(Pi) > MaxFrac.

With these definitions, we can represent the first constraint as:

(saturation constraint) ∀Pi∈P ,¬Saturated(Pi).

We usually set MaxFrac to a value smaller than 1 in order to leaveenough slack for the System S scheduler to dynamically adjustPE placements and CPU allocations during runtime, in responseto changes in the workload.

Second, the ratio of CPU load due to executing the operatorlogic within a partition, compared to the overall CPU load forthe partition, referred to as the effective utilization and denoted byEffectiveUtil, should be greater than or equal to a threshold,say MinUtil. This limits the overhead of inter-PE communica-tion. For instance, if a partition contains a single operator thatperforms very little work on a per-tuple basis, the time spent bythe PE container for receiving and sending tuples will constitutea significant portion of the overall CPU load, resulting in a smallEffectiveUtil value, which is undesirable. Formally:

EffectiveUtil(Pi) =OperLoad(Pi)

OperLoad(Pi) + CommLoad(Pi).

We refer to a partition as underutilized if its effective utilization isbelow MinUtil, that is:

Underutilized(Pi) ≡ EffectiveUtil(Pi) < MinUtil.

852

Page 7: A Code Generation Approach to Optimizing High …bgedik/homepage/lib/exe/fetch.php/...processing platform. Categories and Subject Descriptors H.2.4 [Database Management]: Systems—Query

Algorithm 1: GreedyFuse Algorithm

GREEDYFUSE(O)(1) P ← {Pi : Pi = {Oi} ∧Oi ∈ O}(2) while true(3) Pc ← {{Pi, Pj} ⊂ P : Mergable(Pi, Pj)}(4) if Pc = ∅ then break(5) {Pi, Pj} ← argmax

{Pi,Pj}∈PcMergeBenefit(Pi, Pj)

(6) Pi ← Pi ∪ Pj ; P ← P − {Pj}(7) Label partitions in P , as P1, · · · , Pm

Ideally, we should have no underutilized partitions. Formally:

(utilization constraint) ∀Pi∈P ,¬Underutilized(Pi).

Finally, among solutions that satisfy the saturation and utilizationconstraints, we prefer the one that minimizes inter-PE communi-cation. In other words, the optimization goal is to minimize theobjective function

PPi∈P Rate(Pi).

4.2 GreedyFuse AlgorithmSPADE employs an algorithm named GreedyFuse to create op-

erator partitions. This greedy algorithm starts with a partitioningwhere each operator is assigned to a different partition. At eachgreedy step, we create a set of candidate merges, where a mergeinvolves fusing two of the existing partitions into a new, biggerone. Each candidate merge is assigned a merge benefit and the onewith the highest benefit is applied to complete the greedy step. Thealgorithm continues until no candidate merges are available.

In order to create the candidates for merging, SPADE fusion op-timizer considers all pairs of underutilized partitions, but filters thepairs that are not connected to each other or would violate the sat-uration constraint when merged. Formally,

Mergable(Pi, Pj) ≡ i �= j

∧Connected(Pi, Pj) ∧ ¬Saturated(Pi ∪ Pj)

∧Underutilized(Pi) ∧ Underutilized(Pj).

Note that at each greedy step, an effort is made to remove under-utilized partitions. However, this scheme does not guarantee thatthe final partitioning is completely free of underutilized partitions.

The merge benefit is computed as the amount of inter-PE com-munication saved by merging two partitions, so that each greedystep reduces the objective function to be minimized as much aspossible. Formally,

MergeBenefit(Pi, Pj) = Rate(Pi)+Rate(Pj)−Rate(Pi∪Pj).

Since the merged partitions must be connected by at least onelink, each greedy step reduces the aggregate inter-PE communica-tion, unless the rate of communication between the merged parti-tions is equal to zero. Algorithm 1 gives a summary of the Greedy-Fuse algorithm.

SPADE’s fusion optimizer also performs the placement of PEs tocompute nodes. The details of the placement scheme are beyondthe scope of this paper. In summary, it uses a form of clustering(PEs into nodes) with the goal of minimizing inter-node communi-cation.

5. EXPERIMENTAL EVALUATIONIn this section, we evaluate the performance of SPADE’s code

generation-based fusion and profiling mechanisms, as well as theeffectiveness of its fusion optimizer, using both synthetic and real-world workloads and applications.

5.1 Experimental SetupThe experiments presented in this section were performed on a

small subset of our System S cluster at IBM T. J. Watson, using upto 4 nodes, where each node has a quad-core 3GHz Intel Xeon pro-cessor and 4GB of main memory. The nodes are connected togetherwith a Gigabit Ethernet network. The results reported in the paperare from the steady-state runtime behavior of the applications andare deduced from raw data collected via reservoir sampling. Thedefault sampling rate used for profiling is s = 0.001, and the de-fault reservoir size used is S = 5000. The default MaxFrac andMinUtil values used for the fusion optimizer (see Section 4) are0.5 and 0.95, respectively.

The experimental study is divided into two parts. In the firstpart, we use synthetic workloads and micro benchmarks to studythe trade-offs involved in fusing operators into PEs as well as theimpact of profiling on the application performance. In both cases,we use the steady-state throughput of the application as our evalu-ation metric. In other words, the aggregate rate at which the sourceoperators can pump data into the rest of the processing chain is usedas our performance metric.

In the second part, we apply the fusion optimizer to two real-world applications, using real-world workloads. The first applica-tion is from the radio-astronomy domain and the second one is fromthe financial markets domain. For both applications, we performthe profile/optimize/re-fuse step and compare the performance ofthe resulting optimized application with other alternatives, includ-ing the hand-fused versions that are optimized manually by the ap-plication developers themselves.

5.2 Micro BenchmarksBenefit of fusion: To showcase the benefit of fusion, we performeda series of micro benchmarks. We used a simple application, whichis a series of operators that form a processing chain. The sourceoperator feeds the rest of the processing chain with tuples that itcreates on-the-fly. Each operator in the processing chain simplyforwards the tuples it receives to the next operator in line, until thetuples finally reach the sink. There is no operator logic involved inthe base version of this simple application. We compile this appli-cation with different number of operators, as well as with differentlevels of fusion and examine the resulting throughput in tuples/sec.All the experiments in this section are performed on a single quad-core node.

The graphs in Figure 6 plot throughput in tuples/second as afunction of the number of operators in the processing chain, for dif-ferent numbers of operators per PE. The legend can be read fromFigure 8. For instance, if the number of operators in the chain is16 and the number of operators per PE is 2 (the red line with cir-cle shaped markers), then we have 16/2 = 8 PEs executing the 16operators, where consecutive pair of operators along the chain arefused into their own PEs. We observe from the figure that, for allchain lengths from 1 to 32, the fully fused scenario (number of PEsequal to 1) always prevails over other alternatives. Even though thenode has 4 processors, we are unable to take advantage of pipelineparallelism, since the cost of inter-process communication domi-nates. This is mostly because the effective utilization (as defined inSection 4) is very low for the operators as well as for the fused PEsin the chain, since there is no operator logic involved.

The graphs in Figure 7 plot throughput in tuples/second as afunction of the tuple size, for different numbers of operators perPE. The number of operators in the chain is fixed to 16. Again, fu-sion prevails over other alternatives. It is more interesting to look atthe case, where we have sufficient amount of per-tuple processing,so that the per-PE effective utilization starts to improve in partially

853

Page 8: A Code Generation Approach to Optimizing High …bgedik/homepage/lib/exe/fetch.php/...processing platform. Categories and Subject Descriptors H.2.4 [Database Management]: Systems—Query

2 4 8 16 320

2

4

6

8

10

12x 10

4

number of operators

thro

ug

hp

ut

(tu

ple

s/se

c)

Figure 6: Overhead of no-fusion, with vary-ing number of operators and empty operatorlogic

1 8 64 512 40960

2

4

6

8

10

12x 10

4

tuple size (bytes)

thro

ug

hp

ut

(tu

ple

s/se

c)

Figure 7: Overhead of no-fusion, with vary-ing tuple sizes and empty operator logic

number of OPs per PE = 1number of OPs per PE = 2number of OPs per PE = 4number of OPs per PE = 8number of OPs per PE = 16number of OPs per PE = 32

102

103

104

105

106

101

102

103

104

105

computation size

thro

ug

hp

ut

(tu

ple

s/se

c)

Figure 8: Fusion trade-off, with varying costsof operator logic

0

2

4

6

8

10

12x 10

4

number of OPs

thro

ug

hp

ut

(tu

ple

s/se

c)

Profiling sampling rate = 0Profiling sampling rate = 0.001Profiling sampling rate = 0.01Profiling sampling rate = 0.1Profiling sampling rate = 1

2 4 8 16 32

Figure 9: Profiling overhead, all operators fusedinto one PE

fused scenarios, making it possible to exploit pipelined parallelismavailable from 4 processors. Along these lines, the graphs in Fig-ure 8 plot throughput in tuples/second as a function of the compu-tation size, for different numbers of operators per PE. Computationsize is a variable we use to increase the per-tuple processing cost.In this case, we use a dummy loop to perform a busy wait. Thelarger the computation size, the longer the busy wait. As we ob-serve from Figure 8, for small computation sizes the fuse-all sce-nario still outperforms alternatives, but as the amount of per-tupleprocessing increases, the partially fused scenarios start outperform-ing the fuse-all scenario. This clearly illustrates the trade-off be-tween making use of pipeline parallelism versus reducing the inter-process communication overhead. Recall that the fusion optimizeris designed to stop fusion once the effective utilization for a PEreaches a predefined threshold, which results in avoiding the over-head of inter-process communication and yet taking advantage ofavailable parallelism. In the next subsection we look at how wellthe fusion optimizer performs in real-world applications.

Cost of profiling: One of the major goals of our profiling frame-work is to be able to collect statistics with as little overhead aspossible, so as not to alter the runtime behavior of the applicationdue to instrumentation.

The graphs in Figure 9 plot throughput in tuples/second as afunction the number of operators in the chain, for different pro-filing sampling rates. Note that the reservoir size does not have animpact on the performance, but only on the memory consumption.As a result, we are only showing the impact of the profiling sam-pling rate on the performance. In this experiment all operators arefused into one single PE. Recall that the profiling framework is fu-sion transparent and can work under any fusion scenario. Figure 9

shows that s = 0.001 and no profiling performs almost the same,and s = 0.01 is very close with only around 10% overhead. Whenwe move to s = 0.1, the overhead increases to around 50%. Inour experiments in the rest of the paper, we use a default value ofs = 0.001, which incurs almost no overhead. This is quite con-servative, since the results show that s ≤ 0.01 is overall a goodsetting.

5.3 Real-world Data and ApplicationsIn this section we look at two real-world applications, one from

the radio-astronomy domain and another from the financial marketsdomain. For both applications, we compare the throughput usingthe following four fusion strategies:

NONE − No fusion, one PE per operator

FALL − Fuse-all, one PE hosts all operators

FMAN −Manual fusion performed by the app. developer

FATM − Automatic fusion performed by SPADE

It is worth noting that FMAN represents the scenario where theapplication developer2 uses the knobs exposed by the SPADE lan-guage to perform the fusion herself, including the placement of PEsto nodes.

Outlier Detection/Radio-astronomy: This is an application thatdetects outliers, such as gamma ray bursts, in radio data fromouter space. The data flow graph for the application is shown inFigure 12. The application reads its data from a workload filestored on disk. The workload data is collected from the LOIS [17]project, which is the Scandinavian extension to LOFAR [16] − aradio-telescope currently under construction in northwestern Eu-rope, which consists of a large number of small radio antennas dis-tributed in a large geographical area, instead of having one giantradio-telescope. This application was built using the SPADE lan-guage and the SPADE built-in operators.

Figure 10 shows the throughput achieved from different fusionstrategies, relative to the FATM scenario. The NONE scenario usesall 4 nodes, the FMAN scenario uses 2 nodes, and surprisingly theFATM scenario uses a single node. FALL always uses a singlenode. Recall that each node has 4 cores. The results show that theautomatic fusion (shown in Figure 14) performs 70%, 100%, and85% better compared to NONE, FMAN, and FALL, respectively.Most interestingly, manual fusion performs (shown in Figure 13)

2In our experience, application developers are domain area experts,but not necessarily systems optimization experts.

854

Page 9: A Code Generation Approach to Optimizing High …bgedik/homepage/lib/exe/fetch.php/...processing platform. Categories and Subject Descriptors H.2.4 [Database Management]: Systems—Query

Outlier Detection/Radio-astronomy

Figure 12: Operator-level data flow graph

Figure 13: PE-level data flowgraph, hand-fused

Figure 14: PE-level data flowgraph, auto-fused

Bargain Detection/Financial Trading

Figure 15: Operator-level data flow graph

Figure 16: PE-level data flow graph and placement, hand-fused and hand-placed

Figure 17: PE-level data flow graph, auto-fused and hand-placed

58.34

54.16

100

0 20 40 60 80 100

NONE

FMAN

FALL

FATM

Fusi

on S

trat

egy

Throughput (% relative to FATM)

50.13

Figure 10: Cosmic ray shower detection performance

17.14

57.14

100

0 20 40 60 80 100

NONE

FMAN

FALL

FATM

Fusi

on S

trat

egy

Throughput (% relative to FATM)

50.82

Figure 11: Bargain detection performance

the worst, which attests to the importance of automatic fusion opti-mization and the inherent difficulty of manually optimizing appli-cations. For large-scale applications it is truly impractical to ex-pect application developers to come up with an effective fusion andplacement strategy.

Bargain Detection/Financial Trading: This is an application thatdetects bargains, using financial ticker streams containing trade andquote data from a stock exchange (see [3] for more details). Thedata flow graph for the application, which contains 200 operators, isshown in Figure 15. The application reads its data from a workloadfile stored on disk. The workload data contains 22 days worth ofticker data from the NY Stock Exchange, for the month of Decem-ber 2005. Again, the application is built using the SPADE languageand the SPADE built-in operators.

Figure 11 shows the throughput achieved from different fusionstrategies, relative to the FATM scenario. The NONE and FMANscenarios use all 4 nodes, whereas the FATM scenario uses 2 nodes.Figures 13 and 14 show the manual and automatic fusion, respec-tively. The figures also show how PEs are mapped to nodes. Rect-angular blocks with different colors represent different nodes. Theresults show that the automatic fusion performs 6, 2, and 1.75 timesbetter compared to NONE, FALL, and FMAN, respectively. Al-though we have not studied the optimality of the fusion optimizer,these results are very promising and showcase the strength of ouroptimization framework.

6. RELATED WORKIn the relational data processing world, frameworks such as

STREAM [4], Borealis [1], TelegraphCQ [8], among others, fo-cus on providing a declarative language or a graphical flow-graphcomposition scheme to construct data stream processing applica-tions, as well as a data stream processing runtime to deploy andexecute these applications. A fundamental difference betweenSPADE/System S and these systems is that, SPADE relies on acode-generation framework, instead of having type-generic oper-ator implementations that employ some form of condition interpre-tation and type introspection. The reliance on code generation pro-

855

Page 10: A Code Generation Approach to Optimizing High …bgedik/homepage/lib/exe/fetch.php/...processing platform. Categories and Subject Descriptors H.2.4 [Database Management]: Systems—Query

vides the means for the creation of highly optimized platform- andapplication-specific code. In contrast to traditional database querycompilers, the SPADE compiler outputs code that is very tailored toan application as well as system-specific aspects such as: the under-lying network topology, the distributed processing topology for theapplication (i.e., where each piece will run), and the computationalenvironment, including hardware and architecture-specific tuning.In most cases, applications created with SPADE are long-runningqueries. Hence the long runtimes amortize the build costs. None ofthe existing approaches gives the developer the language constructsor the compiler optimization knobs to write the application in agranular way to truly leverage the levels of parallelism available indistributed settings with modern processor hardware. Unlike tradi-tional database optimization, SPADE does not perform re-orderingson the query graphs, since no assumptions can be made in the gen-eral case, regarding the semantics of the operators contained in thedata-flow graph.

On the programming language side, StreamIt [21] is certainlycloser to us. But its focus is on implementing stream flows forDSP-based applications. XStream [11] is another similar stream-ing system for DSP-like applications. More recently, the Aspenlanguage [23] shares some commonalities with our work on SPADE

– the most important being the philosophy of providing a high-levelprogramming language, shielding users from the complexities of adistributed environment. Many distinctions exist, however. SPADE

is organized in terms of high-level operators, forming toolkits (e.g.,a relational algebra toolkit), that can be extended with additionaloperators. Most importantly, the SPADE compiler generates codeand, hence, can customize the runtime artifacts to the characteris-tics of the runtime environment, including architecture-specific andtopological optimizations.

Finally, contrasting SPADE with Hadoop [12] – as one represen-tative middleware supporting the map/reduce paradigm – or Mary-land’s Active Data Repository [15] and DataCutter [7], the keydifference is the abstraction level employed for writing applica-tions. These approaches rely on “low-level” programming con-structs. Analytics are written from scratch as opposed to relying onbuilt-in, granular, operators. Moreover, map/reduce operations canonly be used for computations that are associative/commutative bynature. Pig Latin [19] was recently introduced to improve upon thelow-level programming abstractions of map-reduce systems. PigLatin sits in a sweet spot between the declarative style of SQL, andthe low-level, procedural style of map-reduce. However, we are notaware of any optimization frameworks tied to Pig Latin. Further-more, Pig Latin is not designed as a streaming system, but insteadas a front-end to map-reduce engines.

7. CONCLUSIONIn this paper, we have addressed a major challenge in optimizing

distributed stream processing applications, that is finding an effec-tive and flexible way of mapping the logical graph of operators intoa physical one that can be deployed on a set of distributed nodes.Using the stream-centric and operator-based SPADE language andits code-generating compiler, we have developed an effective so-lution which relies on a two-staged optimization framework. Firstan instrumented version of the application is generated in order toprofile and learn about the computation and communication char-acteristics of the application. Next, this profiling information isfed to a fusion optimizer that comes up with a physical data flowgraph, which is deployable on the System S distributed runtime andis optimized to strike a good balance between taking advantage ofparallelism and avoiding costly inter-process communication. Wehave used real-world applications and workloads to showcase the

performance benefit that can be achieved from using SPADE’s au-tomatic application optimization framework.

8. REFERENCES[1] D. J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack,

J.-H. Hwang, W. Lindner, A. S. Maskey, A. Rasin, E. Ryvkina,N. Tatbul, Y. Xing, and S. Zdonik. The design of the Borealis streamprocessing engine. In CIDR, 2005.

[2] L. Amini, H. Andrade, R. Bhagwan, F. Eskesen, R. King, P. Selo,Y. Park, and C. Venkatramani. SPC: A distributed, scalable platformfor data mining. In Workshop on Data Mining Standards, Servicesand Platforms, DM-SSP, 2006.

[3] H. Andrade, B. Gedik, K.-L. Wu, and P. S. Yu. Scale-up strategies forprocessing high-rate data streams in System S. In IEEE ICDE, 2009.

[4] A. Arasu, B. Babcock, S. Babu, M. Datar, K. Ito, R. Motwani,I. Nishizawa, U. Srivastava, D. Thomas, R. Varma, and J. Widom.STREAM: The Stanford stream data manager. IEEE DataEngineering Bulletin, 26, 2003.

[5] A. Arasu, S. Babu, and J. Widom. The CQL continuous querylanguage: Semantic foundations and query execution. Technicalreport, InfoLab – Stanford University, October 2003.

[6] H. Balakrishnan, M. Balazinska, D. Carney, U. Cetintemel,M. Cherniack, C. Convey, E. Galvez, J. Salz, M. Stonebraker,N. Tatbul, R. Tibbetts, and S. Zdonik. Retrospective on Aurora.VLDB Journal, Special Issue on Data Stream Processing, 2004.

[7] M. Beynon, R. Ferreira, T. Kurc, A. Sussman, and J. Saltz.DataCutter: Middleware for filtering very large scientific datasets onarchival storage systems. In IEEE Symposium on Mass StorageSystems, MSST, 2000.

[8] S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M.Hellerstein, W. Hong, S. Krishnamurthy, S. R. Madden, V. Raman,F. Reiss, and M. A. Shah. TelegraphCQ: Continuous dataflowprocessing for an uncertain world. In CIDR, 2003.

[9] Coral8, inc. http://www.coral8.com, May 2007.[10] B. Gedik, H. Andrade, K.-L. Wu, and P. S. Yu. SPADE: The System

S declarative stream processing engine. In ACM SIGMOD, 2008.[11] L. Girod, Y. Mei, R. Newton, S. Rost, A. Thiagarajan,

H. Balakrishnan, and S. Madden. XStream: A signal-oriented datastream management system. In IEEE ICDE, 2008.

[12] Hadoop. http://hadoop.apache.org.[13] G. Jacques-Silva, J. Challenger, L. Degenaro, J. Giles, and R. Wagle.

Towards autonomic fault recovery in System S. In InternationalConference on Autonomic Computing, ICAC, 2007.

[14] N. Jain, L. Amini, H. Andrade, R. King, Y. Park, P. Selo, andC. Venkatramani. Design, implementation, and evaluation of thelinear road benchmark on the stream processing core. In ACMSIGMOD, 2006.

[15] T. Kurc, C. Chang, R. Ferreira, A. Sussman, and J. Saltz. Queryingvery large multi-dimensional datasets in ADR. In ACM/IEEESupercomputing Conference, SC, 1999.

[16] LOFAR. http://www.lofar.org/, June 2008.[17] LOIS. http://www.lois-space.net/, June 2008.[18] MATLAB. http://www.mathworks.com, October 2007.[19] C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig

Latin: A not-so-foreign language for data processing. In ACMSIGMOD, 2008.

[20] StreamBase Systems. http://www.streambase.com, May 2008.[21] W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A

language for streaming applications. In International Conference onCompiler Construction, IC, 2002.

[22] J. D. Ullman. Database and Knowledge-Base Systems. ComputerScience Press, 1988.

[23] G. Upadhyaya, V. S. Pai, and S. P. Midkiff. Expressing andexploiting concurrency in networked applications with aspen. InACM Symposium on Principles and Practice of ParallelProgramming, PPoPP, 2007.

[24] J. S. Vitter. Random sampling with a reservoir. ACM Transactions onMathematical Software, 11(1):37–57, 1985.

[25] J. Wolf, N. Bansal, K. Hildrum, S. Parekh, D. Rajan, R. Wagle, K.-L.Wu, and L. Fleischer. SODA: An optimizing scheduler for large-scalestream-based distributed computer systems. In Middleware, 2008.

856