Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s...

30
Models and Issues in Data Stream Systems Brian Babcock Shivnath Babu Mayur Datar Rajeev Motwani Jennifer Widom Department of Computer Science Stanford University Stanford, CA 94305 babcock,shivnath,datar,rajeev,widom @cs.stanford.edu Abstract In this overview paper we motivate the need for and research issues arising from a new model of data processing. In this model, data does not take the form of persistent relations, but rather arrives in multiple, continuous, rapid, time-varying data streams. In addition to reviewing past work relevant to data stream systems and current projects in the area, the paper explores topics in stream query languages, new requirements and challenges in query processing, and algorithmic issues. 1 Introduction Recently a new class of data-intensive applications has become widely recognized: applications in which the data is modeled best not as persistent relations but rather as transient data streams. Examples of such applications include financial applications, network monitoring, security, telecommunications data manage- ment, web applications, manufacturing, sensor networks, and others. In the data stream model, individual data items may be relational tuples, e.g., network measurements, call records, web page visits, sensor read- ings, and so on. However, their continuous arrival in multiple, rapid, time-varying, possibly unpredictable and unbounded streams appears to yield some fundamentally new research problems. In all of the applications cited above, it is not feasible to simply load the arriving data into a tradi- tional database management system (DBMS) and operate on it there. Traditional DBMS’s are not designed for rapid and continuous loading of individual data items, and they do not directly support the continuous queries [84] that are typical of data stream applications. Furthermore, it is recognized that both approxima- tion [13] and adaptivity [8] are key ingredients in executing queries and performing other processing (e.g., data analysis and mining) over rapid data streams, while traditional DBMS’s focus largely on the opposite goal of precise answers computed by stable query plans. In this paper we consider fundamental models and issues in developing a general-purpose Data Stream Management System (DSMS). We are developing such a system at Stanford [82], and we will touch on some of our own work in this paper. However, we also attempt to provide a general overview of the area, along with its related and current work. (Any glaring omissions are, naturally, our own fault.) We begin in Section 2 by considering the data stream model and queries over streams. In this section we take a simple view: streams are append-only relations with transient tuples, and queries are SQL operating over these logical relations. In later sections we discuss several issues that complicate the model and query language, such as ordering, timestamping, and sliding windows. Section 2 also presents some concrete examples to ground our discussion. In Section 3 we review recent projects geared specifically towards data stream processing, as well as a plethora of past research in areas related to data streams: active databases, continuous queries, filtering Work supported by NSF Grant IIS-0118173. Mayur Datar was also supported by a Microsoft Graduate Fellowship. Rajeev Motwani received partial support from an Okawa Foundation Research Grant. 1

Transcript of Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s...

Page 1: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

Models and Issues in Data Stream Systems�

Brian Babcock Shivnath Babu Mayur Datar Rajeev Motwani Jennifer Widom

Department of Computer Science

Stanford University

Stanford, CA 94305�babcock,shivnath,datar,rajeev,widom � @cs.stanford.edu

Abstract

In this overview paper we motivate the need for and research issues arising from a new model ofdata processing. In this model, data does not take the form of persistent relations, but rather arrives inmultiple, continuous, rapid, time-varyingdata streams. In addition to reviewing past work relevant todata stream systems and current projects in the area, the paper explores topics in stream query languages,new requirements and challenges in query processing, and algorithmic issues.

1 Introduction

Recently a new class of data-intensive applications has become widely recognized: applications in whichthe data is modeled best not as persistent relations but rather as transientdata streams. Examples of suchapplications include financial applications, network monitoring, security, telecommunications data manage-ment, web applications, manufacturing, sensor networks, and others. In the data stream model, individualdata items may be relational tuples, e.g., network measurements, call records, web page visits, sensor read-ings, and so on. However, their continuous arrival in multiple, rapid, time-varying, possibly unpredictableand unbounded streams appears to yield some fundamentally new research problems.

In all of the applications cited above, it is not feasible to simply load the arriving data into a tradi-tional database management system (DBMS) and operate on it there. Traditional DBMS’s are not designedfor rapid and continuous loading of individual data items, and they do not directly support thecontinuousqueries[84] that are typical of data stream applications. Furthermore, it is recognized that bothapproxima-tion [13] andadaptivity[8] are key ingredients in executing queries and performing other processing (e.g.,data analysis and mining) over rapid data streams, while traditional DBMS’s focus largely on the oppositegoal of precise answers computed by stable query plans.

In this paper we consider fundamental models and issues in developing a general-purposeData StreamManagement System(DSMS). We are developing such a system at Stanford [82], and we will touch on someof our own work in this paper. However, we also attempt to provide a general overview of the area, alongwith its related and current work. (Any glaring omissions are, naturally, our own fault.)

We begin in Section 2 by considering the data stream model and queries over streams. In this section wetake a simple view: streams are append-only relations with transient tuples, and queries are SQL operatingover these logical relations. In later sections we discuss several issues that complicate the model and querylanguage, such as ordering, timestamping, and sliding windows. Section 2 also presents some concreteexamples to ground our discussion.

In Section 3 we review recent projects geared specifically towards data stream processing, as well asa plethora of past research in areas related to data streams: active databases, continuous queries, filtering

�Work supported by NSF Grant IIS-0118173. Mayur Datar was also supported by a Microsoft Graduate Fellowship. Rajeev

Motwani received partial support from an Okawa Foundation Research Grant.

1

Page 2: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

systems, view management, sequence databases, and others. Although much of this work clearly has ap-plications to data stream processing, we hope to show in this paper that there are many new problems toaddress in realizing a complete DSMS.

Section 4 delves more deeply into the area of query processing, uncovering a number of important issues,including:

� Queries that require an unbounded amount of memory to evaluate precisely, and approximate queryprocessing techniques to address this problem.

� Sliding window query processing (i.e., considering “recent” portions of the streams only), both asan approximation technique and as an option in the query language since many applications prefersliding-window queries.

� Batch processing, sampling, and synopsis structures to handle situations where the flow rate of theinput streams may overwhelm the query processor.

� The meaning and implementation of blocking operators (e.g., aggregation and sorting) in the presenceof unending streams.

� Continuous queries that are registered when portions of the data streams have already “passed by,” yetthe queries wish to reference stream history.

Section 5 then outlines some details of a query language and an architecture for a DSMS query processordesigned specifically to address the issues above.

In Section 6 we review algorithmic results in data stream processing. Our focus is primarily on sketchingtechniques and building summary structures (synopses). We also touch upon sliding window computations,present some negative results, and discuss a few additional algorithmic issues.

We conclude in Section 7 with some remarks on the evolution of this new field, and a summary ofdirections for further work.

2 The Data Stream Model

In the data stream model, some or all of the input data that are to be operated on are not available for randomaccess from disk or memory, but rather arrive as one or morecontinuous data streams. Data streams differfrom the conventional stored relation model in several ways:

� The data elements in the stream arrive online.

� The system has no control over the order in which data elements arrive to be processed, either withina data stream or across data streams.

� Data streams are potentially unbounded in size.

� Once an element from a data stream has been processed it is discarded or archived — it cannot beretrieved easily unless it is explicitly stored in memory, which typically is small relative to the size ofthe data streams.

Operating in the data stream model does not preclude the presence of some data in conventional storedrelations. Often, data stream queries may perform joins between data streams and stored relational data.For the purposes of this paper, we will assume that if stored relations are used, their contents remain static.Thus, we preclude any potential transaction-processing issues that might arise from the presence of updatesto stored relations that occur concurrently with data stream processing.

2

Page 3: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

2.1 Queries

Queries over continuous data streams have much in common with queries in a traditional database manage-ment system. However, there are two important distinctions peculiar to the data stream model. The firstdistinction is betweenone-time queriesandcontinuous queries[84]. One-time queries (a class that includestraditional DBMS queries) are queries that are evaluated once over a point-in-time snapshot of the data set,with the answer returned to the user. Continuous queries, on the other hand, are evaluated continuously asdata streams continue to arrive. Continuous queries are the more interesting class of data stream queries, andit is to them that we will devote most of our attention. The answer to a continuous query is produced overtime, always reflecting the stream data seen so far. Continuous query answers may be stored and updated asnew data arrives, or they may be produced as data streams themselves. Sometimes one or the other modeis preferred. For example, aggregation queries may involve frequent changes to answer tuples, dictating thestored approach, while join queries are monotonic and may produce rapid, unbounded answers, dictatingthe stream approach.

The second distinction is betweenpredefined queriesandad hoc queries. A predefined query is onethat is supplied to the data stream management system before any relevant data has arrived. Predefinedqueries are generally continuous queries, although scheduled one-time queries can also be predefined. Adhoc queries, on the other hand, are issued online after the data streams have already begun. Ad hoc queriescan be either one-time queries or continuous queries. Ad hoc queries complicate the design of a data streammanagement system, both because they are not known in advance for the purposes of query optimization,identification of common subexpressions across queries, etc., and more importantly because the correctanswer to an ad hoc query may require referencing data elements that have already arrived on the datastreams (and potentially have already been discarded). Ad hoc queries are discussed in more detail inSection 4.6.

2.2 Motivating Examples

Examples motivating a data stream system can be found in many application domains including finance,web applications, security, networking, and sensor monitoring.

� Traderbot[85] is a web-based financial search engine that evaluates queries over real-time streamingfinancial data such as stock tickers and news feeds. The Traderbot web site [85] gives some examplesof one-time and continuous queries that are commonly posed by its customers.

� Modern security applications often apply sophisticated rules over network packet streams. For exam-ple, iPolicy Networks[52] provides an integrated security platform providing services such as firewallsupport and intrusion detection over multi-gigabit network packet streams. Such a platform needs toperform complex stream processing including URL-filtering based on table lookups, and correlationacross multiple network traffic flows.

� Large web sites monitor web logs (clickstreams) online to enable applications such as personaliza-tion, performance monitoring, and load-balancing. Some web sites served by widely distributed webservers (e.g., Yahoo [95]) may need to coordinate many distributed clickstream analyses, e.g., to trackheavily accessed web pages as part of their real-time performance monitoring.

� There are several emerging applications in the area of sensor monitoring [16, 58] where a large numberof sensors are distributed in the physical world and generate streams of data that need to be combined,monitored, and analyzed.

3

Page 4: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

The application domain that we use for more detailed examples isnetwork traffic management, whichinvolves monitoring network packet header information across a set of routers to obtain information ontraffic flow patterns. Based on a description of Babu and Widom [10], we delve into this example in somedetail to help illustrate that continuous queries arise naturally in real applications and that conventionalDBMS technology does not adequately support such queries.

Consider the network traffic management system of a large network, e.g., the backbone network of anInternet Service Provider (ISP) [30]. Such systems monitor a variety of continuous data streams that may becharacterized as unpredictable and arriving at a high rate, including both packet traces and network perfor-mance measurements. Typically, current traffic-management tools either rely on a special-purpose systemthat performs online processing of simple hand-coded continuous queries, or they just log the traffic data andperform periodic offline query processing. Conventional DBMS’s are deemed inadequate to provide the kindof online continuous query processing that would be most beneficial in this domain. A data stream systemthat could provide effective online processing of continuous queries over data streams would allow networkoperators to install, modify, or remove appropriate monitoring queries to support efficient management ofthe ISP’s network resources.

Consider the following concrete setting. Network packet traces are being collected from a number oflinks in the network. The focus is on two specific links: a customer link, C, which connects the network ofa customer to the ISP’s network, and a backbone link, B, which connects two routers within the backbonenetwork of the ISP. Let� and � denote two streams of packet traces corresponding to these two links. Weassume, for simplicity, that the traces contain just the five fields of the packet header that are listed below.

src: IP address of packet sender.

dest: IP address of packet destination.

id: Identification number given by sender so that destination can uniquely identify each packet.

len: Length of the packet.

time: Time when packet header was recorded.

Consider first the continuous query� � , which computes load on the link B averaged over one-minuteintervals, notifying the network operator when the load crosses a specified threshold . The functionsget-minute andnotifyoperator have the natural interpretation.

� � : SELECT notifyoperator(sum(len))FROM �

GROUP BY getminute(time)HAVING sum(len) �

While the functionality of such a query may possibly be achieved in a DBMS via the use of triggers, weare likely to prefer the use of special techniques for performance reasons. For example, consider the casewhere the link B has a very high throughput (e.g., if it were an optical link). In that case, we may choose tocompute an approximate answer to� � by employing random sampling on the stream — a task outside thereach of standard trigger mechanisms.

The second query� isolates flows in the backbone link and determines the amount of traffic generatedby each flow. A flow is defined here as a sequence of packets grouped in time, and sent from a specificsource to a specific destination.

4

Page 5: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

� : SELECT flowid, src, dest, sum(len) AS flowlenFROM (SELECT src, dest, len, time

FROM �ORDER BY time )

GROUP BY src, dest,getflowid(src, dest, time)AS flowid

Heregetflowid is a user-defined function which takes the source IP address, the destination IP address,and the timestamp of a packet, and returns the identifier of the flow to which the packet belongs. We assumethat the data in the view (or table expression) in the FROM clause is passed to thegetflowid function inthe order defined by the ORDER BY clause.

Observe that handling� over stream� is particularly challenging due to the presence of GROUP BYand ORDER BY clauses, which lead to “blocking” operators in a query execution plan.

Consider now the task of determining the fraction of the backbone link’s traffic that can be attributed tothe customer network. This query,� � , is an example of the kind of ad hoc continuous queries that may beregistered during periods of congestion to determine whether the customer network is the likely cause.

� � : (SELECT count (*)FROM C, BWHERE C.src = B.src and C.dest = B.dest

and C.id = B.id)�(SELECT count (*) FROM� )

Observe that� � joins streams� and � on their keys to obtain a count of the number of common packets.Since joining two streams could potentially require unbounded intermediate storage (for example if there isno bound on the delay between a packet showing up on the two links), the user may prefer to compute anapproximate answer. One approximation technique would be to maintain bounded-memorysynopsesof thetwo streams (see Section 6); alternatively, one could exploit aspects of the application semantics to boundthe required storage (e.g., we may know that joining tuples are very likely to occur within a bounded timewindow).

Our final example,� � , is a continuous query for monitoring the source-destination pairs in the top 5percent in terms of backbone traffic. For ease of exposition, we employ the WITH construct from SQL-99 [87].

� � : WITH Load AS(SELECT src, dest, sum(len) AS trafficFROM �GROUP BY src, dest)

SELECT src, dest, trafficFROM Load AS� �WHERE (SELECT count(*)

FROM Load AS� WHERE � .traffic � � � .traffic) �(SELECT � � � � ! count(*) FROM Load)

ORDER BY traffic

5

Page 6: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

3 Review of Data Stream Projects

We now provide an overview of several past and current projects related to data stream management. Wewill revisit some of these projects in later sections when we discuss the issues that we are facing in buildinga general-purpose data stream management system at Stanford.

Continuous queries were used in theTapestrysystem [84] for content-based filtering over an append-only database of email and bulletin board messages. A restricted subset of SQL was used as the querylanguage in order to provide guarantees about efficient evaluation and append-only query results. TheAlertsystem [74] provides a mechanism for implementingevent-condition-actionstyle triggers in a conventionalSQL database, by using continuous queries defined over special append-onlyactive tables. The XFiltercontent-based filtering system [6] performs efficient filtering of XML documents based on user profilesexpressed as continuous queries in theXPathlanguage [94].Xyleme[67] is a similar content-based filteringsystem that enables very high throughput with a restricted query language. TheTribecastream databasemanager [83] provides restricted querying capability over network packet streams. TheTangramstreamquery processing system [68, 69] uses stream processing techniques to analyze large quantities of storeddata.

The OpenCQ[57] andNiagaraCQ[24] systems support continuous queries for monitoring persistentdata sets spread over a wide-area network, e.g., web sites over the Internet. OpenCQ uses a query process-ing algorithm based on incremental view maintenance, while NiagaraCQ addresses scalability in numberof queries by proposing techniques for grouping continuous queries for efficient evaluation. Within the Ni-agaraCQ project, Shanmugasundaram et al. [79] discuss the problem of supporting blocking operators inquery plans over data streams, and Viglas and Naughton [89] proposerate-based optimizationfor queriesover data streams, a new optimization methodology that is based on stream-arrival and data-processing rates.

TheChronicle data model[55] introduced append-only ordered sequences of tuples (chronicles), a formof data streams. They defined a restricted view definition language and algebra (chronicle algebra) thatoperates over chronicles together with traditional relations. The focus of the work was to ensure that viewsdefined in chronicle algebra could be maintained incrementally without storing any of the chronicles. Analgebra and a declarative query language for querying ordered relations (sequences) was proposed by Se-shadri, Livny, and Ramakrishnan [76, 77, 78]. In many applications, continuous queries need to refer to thesequencing aspect of streams, particularly in the form of sliding windows over streams. Related work in thiscategory also includes work on temporal [80] and time-series databases [31], where the ordering of tuplesimplied by time can be used in querying, indexing, and query optimization.

The body of work on materialized views relates to continuous queries, since materialized views areeffectively queries that need to be reevaluated or incrementally updated whenever the base data changes. Ofparticular importance is work onself-maintenance[15, 45, 71]—ensuring that enough data has been saved tomaintain a view even when the base data is unavailable—and the related problem ofdata expiration[36]—determining when certain base data can be discarded without compromising the ability to maintain a view.Nevertheless, several differences exist between materialized views and continuous queries in the data streamcontext: continuous queries may stream rather than store their results, they may deal with append-only inputdata, they may provide approximate rather than exact answers, and their processing strategy may adapt ascharacteristics of the data streams change.

The Telegraphproject [8, 47, 58, 59] shares some target applications and basic technical ideas with aDSMS. Telegraph uses anadaptivequery engine (based on theEddyconcept [8]) to process queries effi-ciently in volatile and unpredictable environments (e.g., autonomous data sources over the Internet, or sensornetworks). Madden and Franklin [58] focus on query execution strategies over data streams generated bysensors, and Madden et al. [59] discuss adaptive processing techniques for multiple continuous queries. TheTukwilasystem [53] also supports adaptive query processing, in order to perform dynamic data integrationover autonomous data sources.

6

Page 7: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

The Aurora project [16] is building a new data processing system targeted exclusively towards streammonitoring applications. The core of the Aurora system consists of a large network of triggers. Eachtrigger is a data-flow graph with each node being one among seven built-in operators (orboxesin Aurora’sterminology). For each stream monitoring application using the Aurora system, anapplication administratorcreates and adds one or more triggers into Aurora’s trigger network. Aurora performs both compile-timeoptimization (e.g., reordering operators, shared state for common subexpressions) and run-time optimizationof the trigger network. As part of run-time optimization, Aurora detects resource overload and performsloadsheddingbased on application-specific measures of quality of service.

4 Queries over Data Streams

Query processing in the data stream model of computation comes with its own unique challenges. In thissection, we will outline what we consider to be the most interesting of these challenges, and describe severalalternative approaches for resolving them. The issues raised in this section will frame the discussion in therest of the paper.

4.1 Unbounded Memory Requirements

Since data streams are potentially unbounded in size, the amount of storage required to compute an exactanswer to a data stream query may also grow without bound. While external memory algorithms [91] forhandling data sets larger than main memory have been studied, such algorithms are not well suited to datastream applications since they do not support continuous queries and are typically too slow for real-timeresponse. The continuous data stream model is most applicable to problems where timely query responsesare important and there are large volumes of data that are being continually produced at a high rate overtime. New data is constantly arriving even as the old data is being processed; the amount of computationtime per data element must be low, or else the latency of the computation will be too high and the algorithmwill not be able to keep pace with the data stream. For this reason, we are interested in algorithms that areable to confine themselves to main memory without accessing disk.

Arasu et al. [7] took some initial steps towards distinguishing between queries that can be answered ex-actly using a given bounded amount of memory and queries that must be approximated unless disk accessesare allowed. They consider a limited class of queries and, for that class, provide a complete characterizationof the queries that require a potentially unbounded amount of memory (proportional to the size of the inputdata streams) to answer. Their result shows that without knowing the size of the input data streams, it isimpossible to place a limit on the memory requirements for most common queries involving joins, unless thedomains of the attributes involved in the query are restricted (either based on known characteristics of thedata or through the imposition of query predicates). The basic intuition is that without domain restrictionsan unbounded number of attribute values must be remembered, because they might turn out to join withtuples that arrive in the future. Extending these results to full generality remains an open research problem.

4.2 Approximate Query Answering

As described in the previous section, when we are limited to a bounded amount of memory it is not alwayspossible to produce exact answers for data stream queries; however, high-quality approximate answers areoften acceptable in lieu of exact answers. Approximation algorithms for problems defined over data streamshas been a fruitful research area in the algorithms community in recent years, as discussed in detail inSection 6. This work has led to some general techniques for data reduction and synopsis construction,including: sketches [5, 35], random sampling [1, 2, 22], histograms [51, 70], and wavelets [17, 92]. Basedon these summarization techniques, we have seen some work on approximate query answering. For example,

7

Page 8: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

recent work [27, 37] develops histogram-based techniques to provide approximate answers forcorrelatedaggregate queriesover data streams, and Gilbert et al. [40] present a general approach for building small-space summaries over data streams to provide approximate answers for many classes of aggregate queries.However, research problems abound in the area of approximate query answering, with or without streams.Even the basic notion of approximations remains to be investigated in detail for queries involving more thansimple aggregation. In the next two subsections, we will touch upon several approaches to approximation,some of which are peculiar to the data stream model of computation.

4.3 Sliding Windows

One technique for producing an approximate answer to a data stream query is to evaluate the query notover the entire past history of the data streams, but rather only over sliding windows of recent data from thestreams. For example, only data from the last week could be considered in producing query answers, withdata older than one week being discarded.

Imposing sliding windows on data streams is a natural method for approximation that has several attrac-tive properties. It is well-defined and easily understood: the semantics of the approximation are clear, sothat users of the system can be confident that they understand what is given up in producing the approximateanswer. It is deterministic, so there is no danger that unfortunate random choices will produce a bad ap-proximation. Most importantly, it emphasizes recent data, which in the majority of real-world applicationsis more important and relevant than old data: if one is trying in real-time to make sense of network trafficpatterns, or phone call or transaction records, or scientific sensor data, then in general insights based on therecent past will be more informative and useful than insights based on stale data. In fact, for many suchapplications, sliding windows can be thought of not as an approximation technique reluctantly imposed dueto the infeasibility of computing over all historical data, but rather as part of the desired query semanticsexplicitly expressed as part of the user’s query. For example, queries� � and � � from Section 2.2, whichtracked traffic on the network backbone, would likely be applied not to all traffic over all time, but rather totraffic in the recent past.

There are a variety of research issues in the use of sliding windows over data streams. To begin with,as we will discuss in Section 5.1, there is the fundamental issue of how we define timestamps over thestreams to facilitate the use of windows. Extending SQL or relational algebra to incorporate explicit windowspecifications is nontrivial and we also touch upon this topic in Section 5.1. The implementation of slidingwindow queries and their impact on query optimization is a largely untouched area. In the case where thesliding window is large enough so that the entire contents of the window cannot be buffered in memory,there are also theoretical challenges in designing algorithms that can give approximate answers using onlythe available memory. Some recent results in this vein can be found in [9, 26].

While existing work on sequence and temporal databases has addressed many of the issues involved intime-sensitive queries (a class that includes sliding window queries) in a relational database context [76,77, 78, 80], differences in the data stream computation model pose new challenges. Research in temporaldatabases [80] is concerned primarily with maintaining a full history of each data value over time, while ina data stream system we are concerned primarily with processing new data elements on-the-fly. Sequencedatabases [76, 77, 78] attempt to produce query plans that allow forstream access, meaning that a singlescan of the input data is sufficient to evaluate the plan and the amount of memory required for plan evaluationis a constant, independent of the data. This model assumes that the database system has control over whichsequence to process tuples from next, e.g., when merging multiple sequences, which we cannot assume in adata stream system.

8

Page 9: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

4.4 Batch Processing, Sampling, and Synopses

Another class of techniques for producing approximate answers is to give up on processing every data el-ement as it arrives, resorting to some sort of sampling or batch processing technique to speed up queryexecution. We describe a general framework for these techniques. Suppose that a data stream query isanswered using a data structure that can be maintained incrementally. The most general description ofsuch a data structure is that it supports two operations,update(tuple) andcomputeAnswer(). Theupdate operation is invoked to update the data structure as each new data element arrives, and thecom-puteAnswer method produces new or updated results to the query. When processing continuous queries,the best scenario is that both operations are fast relative to the arrival rate of elements in the data streams. Inthis case, no special techniques are necessary to keep up with the data stream and produce timely answers:as each data element arrives, it is used to update the data structure, and then new results are computed fromthe data structure, all in less than the average inter-arrival time of the data elements. If one or both of thedata structure operations are slow, however, then producing an exact answer that is continually up to date isnot possible. We consider the two possible bottlenecks and approaches for dealing with them.

Batch Processing

The first scenario is that theupdate operation is fast but thecomputeAnswer operation is slow. Inthis case, the natural solution is to process the data in batches. Rather than producing a continually up-to-date answer, the data elements are buffered as they arrive, and the answer to the query is computedperiodically as time permits. The query answer may be considered approximate in the sense that it is nottimely, i.e., it represents the exact answer at a point in the recent past rather than the exact answer at thepresent moment. This approach of approximation through batch processing is attractive because it doesnot cause any uncertainty about the accuracy of the answer, sacrificing timeliness instead. It is also a goodapproach when data streams are bursty. An algorithm that cannot keep up with the peak data stream ratemay be able to handle the average stream rate quite comfortably by buffering the streams when their rate ishigh and catching up during the slow periods. This is the approach used in the XJoin algorithm [88].

Sampling

In the second scenario,computeAnswermay be fast, but theupdate operation is slow — it takes longerthan the average inter-arrival time of the data elements. It is futile to attempt to make use of all the data whencomputing an answer, because data arrives faster than it can be processed. Instead, some tuples must beskipped altogether, so that the query is evaluated over a sample of the data stream rather than over the entiredata stream. We obtain an approximate answer, but in some cases one can give confidence bounds on thedegree of error introduced by the sampling process [48]. Unfortunately, for many situations (including mostqueries involving joins [20, 22]), sampling-based approaches cannot give reliable approximation guarantees.Designing sampling-based algorithms that can produce approximate answers that are provably close to theexact answer is an important and active area of research.

Synopsis Data Structures

Quite obiously, data structures where both theupdate and thecomputeAnswer operations are fast aremost desirable. For classes of data stream queries where no exact data structure with the desired propertiesexists, one can often design an approximate data structure that maintains a smallsynopsisor sketchof thedata rather than an exact representation, and therefore is able to keep computation per data element to aminimum. Performing data reduction through synopsis data structures as an alternative to batch processing

9

Page 10: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

or sampling is a fruitful research area with particular relevance to the data stream computation model.Synopsis data structures are discussed in more detail in Section 6.

4.5 Blocking Operators

A blocking query operatoris a query operator that is unable to produce the first tuple of its output until it hasseen its entire input. Sorting is an example of a blocking operator, as are aggregation operators such as SUM,COUNT, MIN, MAX, and AVG. If one thinks about evaluating continuous stream queries using a traditionaltree of query operators, where data streams enter at the leaves and final query answers are produced at theroot, then the incorporation of blocking operators into the query tree poses problems. Since continuousdata streams may be infinite, a blocking operator that has a data stream as one of its inputs will never seeits entire input, and therefore it will never be able to produce any output. Clearly, blocking operators arenot very suitable to the data stream computation model, but aggregate queries are extremely common, andsorted data is easier to work with and can often be processed more efficiently than unsorted data. Doingaway with blocking operators altogether would be problematic, but dealing with them effectively is one ofthe more challenging aspects of data stream computation.

Blocking operators that are the root of a tree of query operators are more tractable than blocking op-erators that are interior nodes in the tree, producing intermediate results that are fed to other operators forfurther processing (for example, the “sort” phase of a sort-merge join, or an aggregate used in a subquery).When we have a blocking aggregation operator at the root of a query tree, if the operator produces a singlevalue or a small number of values, then updates to the answer can be streamed out as they are produced.When the answer is larger, however, such as when the query answer is a relation that is to be produced insorted order, it is more practical to maintain a data structure with the up-to-date answer, since continuallyretransmitting the entire answer would be cumbersome. Neither of these two approaches works well forblocking operators that produce intermediate results, however. The central problem is that the results pro-duced by blocking operators may continue to change over time until all the data has been seen, so operatorsthat are consuming those results cannot make reliable decisions based on the results at an intermediate stageof query execution.

One approach to handling blocking operators as interior nodes in a query tree is to replace them withnon-blocking analogs that perform approximately the same task. An example of this approach is thejuggleoperator [72], which is a non-blocking version of sort: it aims to locally reorder a data stream so thattuples that come earlier in the desired sort order are produced before tuples that come later in the sort order,although some tuples may be delivered out of order. An interesting open problem is how to extend this workto other types of blocking operators, as well as to quantify the error that is introduced by approximatingblocking operators with non-blocking ones.

Tucker et al. [86] have proposed a different approach to blocking operators. They suggest augmentingdata streams with assertions about what can and cannot appear in the remainder of the data stream. Theseassertions, which are calledpunctuations, are interleaved with the data elements in the streams. An exampleof the type of punctuation one might see in a stream with an attribute calleddaynumber is “for all futuretuples, % & ' ) + - . / 1 2 4 � .” Upon seeing this punctuation, an aggregation operator that was grouping bydaynumber could stream out its answers for alldaynumbers less than 10. Similarly, a join operator coulddiscard all its saved state relating to previously-seen tuples in the joining stream with% & ' ) + - . / 1 � 4 � ,reducing its memory consumption.

An interesting open problem is to formalize the relationship between punctuation and the memory re-quirements of a query — e.g., a query that might otherwise require unbounded memory could be provedto be answerable in bounded memory if guarantees about the presence of appropriate punctuation are pro-vided. Closely related is the idea of schema-level assertions (constraints) on data streams, which also mayhelp with blocking operators and other aspects of data stream processing. For example, we may know that

10

Page 11: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

daynumbers are clustered or strictly increasing, or when joining two stream we may know that a kind of“referential integrity” exists in the arrival of join attribute values. In both cases we may use these constraintsto “unblock” operators or reduce memory requirements.

4.6 Queries Referencing Past Data

In the data stream model of computation, once a data element has been streamed by, it cannot be revisited.This limitation means that ad hoc queries that are issued after some data has already been discarded may beimpossible to answer accurately. One simple solution to this problem is to stipulate that ad hoc queries areonly allowed to reference future data: they are evaluated as though the data streams began at the point whenthe query was issued, and any past stream elements are ignored (for the purposes of that query). While thissolution may not appear very satisfying, it may turn out to be perfectly acceptable for many applications.

A more ambitious approach to handling ad hoc queries that reference past data is to maintain summariesof data streams (in the form of general-purpose synopses or aggregates) that can be used to give approximateanswers to future ad hoc queries. Taking this approach requires making a decision in advance about the bestway to use memory resources to give good approximate answers to a broad range of possible future queries.The problem is similar in some ways to problems in physical database design such as selection of indexesand materialized views [23]. However, there is an important difference: in a traditional database system,when an index or view is lacking, it is possible to go to the underlying relation, albeit at an increased cost.In the data stream model of computation, if the appropriate summary structure is not present, then no furtherrecourse is available.

Even if ad hoc queries are declared only to pertain to future data, there are still research issues involvedin how best to process them. In data stream applications, where most queries are long-lived continuousqueries rather than ephemeral one-time queries, the gains that can be achieved by multi-query optimizationcan be significantly greater than what is possible in traditional database systems. The presence of ad hocqueries transforms the problem of finding the best query plan for a set of queries from an offline problemto an online problem. Ad hoc queries also raise the issue of adaptivity in query plans. The Eddy queryexecution framework [8] introduces the notion of flexible query plans that adapt to changes in data arrivalrates or other data characteristics over time. Extending this idea to adapt the joint plan for a set of continuousqueries as new queries are added and old ones are removed remains an open research area.

5 Proposal for a DSMS

At Stanford we have begun the design and prototype implementation of a comprehensive DSMS calledSTREAM(for STanford StREam DatA Manager) [82]. As discussed in earlier sections, in a DSMS tradi-tional one-time queries are replaced or augmented with continuous queries, and techniques such as slidingwindows, synopsis structures, approximate answers, and adaptive query processing become fundamentalfeatures of the system. Other aspects of a complete DBMS also need to be reconsidered, including querylanguages, storage and buffer management, user and application interfaces, and transaction support. In thissection we will focus primarily on the query language and query processing components of a DSMS andonly touch upon other issues based on our initial experiences.

5.1 Query Language for a DSMS

Any general-purpose data management system must have a flexible and intuitive method by which the usersof the system can express their queries. In the STREAM project, we have chosen to use a modified versionof SQL as the query interface to the system (although we are also providing a means to submit query plansdirectly). SQL is a well-known language with a large user population. It is also a declarative language

11

Page 12: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

that gives the system flexibility in selecting the optimal evaluation procedure to produce the desired answer.Other methods for receiving queries from users are possible; for example, the Aurora system describedin [16] uses a graphical “boxes and arrows” interface for specifying data flow through the system. Thisinterface is intuitive and gives the user more control over the exact series of steps by which the query answeris obtained than is provided by a declarative query language.

The main modification that we have made to standard SQL, in addition to allowing the FROM clauseto refer to streams as well as relations, is to extend the expressiveness of the query language for slidingwindows. It is possible to formulate sliding window queries in SQL by referring to timestamps explicitly,but it is often quite awkward. SQL-99 [14, 81] introduces analytical functions that partially address theshortcomings of SQL for expressing sliding window queries by allowing the specification of moving aver-ages and other aggregation operations over sliding windows. However, the SQL-99 syntax is not sufficientlyexpressive for data stream queries since it cannot be applied to non-aggregation operations such as joins.

The notion of sliding windows requires at least an ordering on data stream elements. In many cases,the arrival order of the elements suffices as an “implicit timestamp” attached to each data element; how-ever, sometimes it is preferable to use “explicit timestamps” provided as part of the data stream. For-mally we say (following [16]) that a data stream: consists of a set of (tuple, timestamp) pairs:: ;= > ? � A C � E A > ? A C E A � � � A > ? K A C K E L . The timestamp attribute could be a traditional timestamp or it could be asequence number — all that is required is that it comes from a totally ordered domain with a distance met-ric. The ordering induced by the timestamps is used when selecting the data elements making up a slidingwindow.

We extend SQL by allowing an optionalwindow specificationto be provided, enclosed in brackets,after a stream (or subquery producing a stream) that is supplied in a query’s FROM clause. A windowspecification consists of:

1. an optional partitioning clause, which partitions the data into several groups and maintains a separatewindow for each group,

2. a window size, either in “physical” units (i.e., the number of data elements in the window) or in“logical” units (i.e., the range of time covered by a window, such as 30 days), and

3. an optional filtering predicate.

As in SQL-99, physical windows are specified using theROWS keyword (e.g.,ROWS 50 PRECEDING),while logical windows are specified via theRANGE keyword (e.g.,RANGE 15 MINUTES PRECEDING).In lieu of a formal grammar, we present several examples to illustrate our language extension.

The underlying source of data for our examples will be a stream of telephone call records, each with fourattributes:customer id, type, minutes, andtimestamp. Thetimestamp attribute is the orderingattribute for the records. Suppose a user wanted to compute the average call length, but considering only theten most recent long-distance calls placed by each customer. The query can be formulated as follows:

SELECT AVG(S.minutes)FROM Calls S [PARTITION BY S.customerid

ROWS 10 PRECEDINGWHERE S.type = ’Long Distance’]

where the expression in braces defines a sliding window on the stream of calls.Contrast the previous query to a similar one that computes the average call length considering only

long-distance calls that are among the last 10 calls of all types placed by each customer:

12

Page 13: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

SELECT AVG(S.minutes)FROM Calls S [PARTITION BY S.customerid

ROWS 10 PRECEDING]WHERE S.type = ’Long Distance’

The distinction between filtering predicates applied before calculating the sliding window cutoffs and pred-icates applied after windowing motivates our inclusion of an optional WHERE clause within the windowspecification.

Here is a slightly more complicated example returning the average length of the last 1000 telephonecalls placed by “Gold” customers:

SELECT AVG(V.minutes)FROM (SELECT S.minutes

FROM Calls S, Customers TWHERE S.customerid = T.customeridAND T.tier = ’Gold’)V [ROWS 1000 PRECEDING]

Notice that in this example, the stream of calls must be joined to the Customers relation before applying thesliding window.

5.2 Timestamps in Streams

In the previous section, sliding windows are defined with respect to a timestamp or sequence number at-tribute representing a tuple’s arrival time. This approach is unambiguous for tuples that come from a singlestream, but it is less clear what is meant when attempting to apply sliding windows to composite tuples thatare derived from tuples from multiple underlying streams (e.g., windows on the output of a join operator).What should the timestamp of a tuple in the join result be when the timestamps of the tuples that were joinedto form the result tuple are different? Timestamp issues also arise when a set of distributed streams make upa single logical stream, as in the web monitoring application described in Section 2.2, or in truly distributedstreams such as sensor networks when comparing timestamps across stream elements may be relevant.

In the previous section we introducedimplicit timestamps, in which the system adds a special fieldto each incoming tuple, andexplicit timestamps, in which a data attribute is designated as the timestamp.Explicit timestamps are used when each tuple corresponds to a real-world event at a particular time thatis of importance to the meaning of the tuple. Implicit timestamps are used when the data source doesnot already include timestamp information, or when the exact moment in time associated with a tuple isnot important, but general considerations such as “recent” or “old” may be important. The distinctionbetween implicit and explicit timestamps is similar to that betweentransactionandvalid time in the temporaldatabase literature [80].

Explicit timestamps have the drawback that tuples may not arrive in the same order as their timestamps— tuples with later timestamps may come before tuples with earlier timestamps. This lack of guaranteedordering makes it difficult to perform sliding window computations that are defined in relation to explicittimestamps, or any other processing based on order. However, as long as an input stream is “almost-sorted”by timestamp, except for local perturbations, then out-of-order tuples can easily be corrected with littlebuffering. It seems reasonable to assume that even when explicit timestamps are used, tuples will be deliv-ered in roughly increasing timestamp order.

Let us now look at how to assign appropriate timestamps to tuples output by binary operators, usingjoin as an example. There are several possible approaches that could be taken; we discuss two. The firstapproach, which fits better with implicit timestamps, is to provide no guarantees about the output order of

13

Page 14: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

tuples from a join operator, but to simply assume that tuples that arrive earlier are likely to pass throughthe join earlier; exact ordering may depend on implementation and scheduling vagaries. Each tuple that isproduced by a join operator is assigned an implicit timestamp that is set to the time that it was produced bythe join operator. This “best-effort” approach has the advantage that it maximizes implementation flexibility;it has the disadvantage that it makes it impossible to impose precisely-defined, deterministic sliding-windowsemantics on the results of subqueries.

The second approach, which fits with either explicit or implicit timestamps, is to have the user specify aspart of the query what timestamp is to be assigned to tuples resulting from the join of multiple streams. Onesimple policy is that the order in which the streams are listed in the FROM clause of the query represents aprioritization of the streams. The timestamp for a tuple output by a join should be the timestamp of the join-ing tuple from the input stream listed first in the FROM clause. This approach can result in multiple tupleswith the same timestamp; for the purpose of ordering the results, ties can be broken using the timestamp ofthe other input stream. For example, if the query is

SELECT *FROM S1 [ROWS 1000 PRECEDING],

S2 [ROWS 100 PRECEDING]WHERE S1.A = S2.B

then the output tuples would first be sorted by the timestamp of S1, and then ties would be broken accordingto the timestamp of S2.

The second, stricter approach to assigning timestamps to the results of binary operators can have adrawback from an implementation point of view. If it is desirable for the output from a join to be sorted bytimestamp, the join operator will need to buffer output tuples until it can be certain that future input tupleswill not disrupt the ordering of output tuples. For example, if S1’s timestamp has priority over S2’s and arecent S1 tuple joins with an S2 tuple, it is possible that a future S2 tuple will join with an older S1 tuple thatstill falls within the current window. In that case, the join tuple that was produced second belongs before thejoin tuple that was produced first. In a query tree consisting of multiple joins, the extra latency introducedfor this reason could propagate up the tree in an additive fashion. If the inputs to the join operator did nothave sliding windows at all, then the join operator could never confidently produce outputs in sorted order.

As mentioned earlier, sliding windows have two distinct purposes: sometimes they are an importantpart of the query semantics, and other times they are an approximation scheme to improve query efficiencyand reduce data volumes to a manageable size. When the sliding window serves mostly to increase queryprocessing efficiency, then the best-effort approach, which allows wide latitude over the ordering of tuples,is usually acceptable. On the other hand, when the ordering of tuples plays a significant role in the meaningof the query, such as for query-defined sliding windows, then the stricter approach may be preferred, even atthe cost of less efficient implementation. A general-purpose data stream processing system should supportboth types of sliding windows, and the query language should allow users to specify one or the other.

In our system, we add an extra keyword,RECENT, that replacesPRECEDING when a “best-effort”ordering may be used. For example, the clauseROWS 10 PRECEDING specifies a window consistingof the previous 10 tuples, strictly sorted by timestamp order. By comparison,ROWS 10 RECENT alsospecifies a sliding window consisting of 10 records, but the DSMS is allowed to use its own ordering toproduce the sliding window, rather than being constrained to follow the timestamp ordering. TheRECENTkeyword is only used with “physical” window sizes specified as a number of records; “logical” windowssuch asRANGE 3 DAYS PRECEDING must use thePRECEDING keyword.

14

Page 15: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

N O N P

N Q

S T O

S T P U V X Q

U V X O U V X P

Figure 1: A portion of a query plan in our DSMS.

5.3 Query Processing Architecture of a DSMS

In this section, we describe the query processing architecture of our DSMS. So far we have focused on con-tinuous queries only. When a query is registered, aquery execution planis produced that begins executingand continues indefinitely. We have not yet addressed ad hoc queries registered after relevant streams havebegun (Section 4.6).

Query execution plans in our system consist ofoperatorsconnected byqueues. Operators can maintainintermediate state insynopsisdata structures. A portion of an example query plan is shown in Figure 1, withone binary operator ([ \ ^ ) and one unary operator ([ \ _ ). The two operators are connected by a queue� � ,and operator[ \ ^ has two input queues,� � and � . Also shown in Figure 1 are two synopsis structuresused by operator[ \ ^ , a b c ^ and a b c _ , one per input. For example,[ \ ^ could be asliding window joinoperator, which maintains a sliding window synopsis for each join input (Section 4.3). The system memoryis partitioned dynamically among the synopses and queues in query plans, along with the buffers used forhandling streams coming over the network and a cache for disk-resident data. Note that both Aurora [16]and Eddies [8] use a single globally-shared queue for inter-operator data flow instead of separate queuesbetween operators as in Figure 1.

Operators in our system are scheduled for execution by a centralscheduler. During execution, an op-erator reads data from its input queues, updates the synopsis structures that it maintains, and writes resultsto its output queues. (Our operators thus adhere to theupdate andcomputeAnswer model discussedin Section 4.4.) The period of execution of an operator is determined dynamically by the scheduler and theoperator returns control back to the scheduler once its period expires. We are experimenting with differentpolicies for scheduling operators and for determining the period of execution. The period of execution maybe based on time, or it may be based on other quantities, such as the number of tuples consumed or pro-duced. Both Aurora and Eddies have chosen to perform fine-grained scheduling where, in each step, thescheduler chooses a tuple from the global queue and passes it to an operator for processing, an approach thatour scheduler could choose if appropriate.

We expect continuous queries and the data streams on which they operate to be long-running. Duringthe lifetime of a continuous query parameters such as stream data characteristics, stream flow rates, andthe number of concurrently running queries may vary considerably. To handle these fluctuations, all of ouroperators areadaptive. So far we have focused primarily on adaptivity to available memory, although otherfactors could be considered, including using disk to increase temporary storage at the expense of latency.

15

Page 16: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

Our approach to memory adaptivity is basically one of trading accuracy for memory. Specifically, eachoperator maximizes the accuracy of its output based on the size of its available memory, and handles dynamicchanges in the size of its available memory gracefully, since at run-time memory may be taken away fromone operator and given to another. As a simple example, a sliding window join operator as discussed abovemay be used as an approximation to a join over the entire history of input streams. If so, the larger thewindows (stored in available memory), the better the approximation. Other examples include duplicateelimination using limited-size hash tables, and sampling usingreservoirs[90]. The Aurora system [16] alsoproposes adaptivity and approximations, and uses load-shedding techniques based on application-specifiedmeasures of quality of service for graceful degradation in the face of system overload.

Our fundamental approach of trading accuracy for memory brings up some interesting problems:

� We first need to understand how different query operators can produce approximate answers underlimited memory, and how approximate results behave when operators are composed in query plans.

� Given a query plan as a tree of operators and a certain amount of memory, how can the DSMSallocate memory to the operators to maximize the accuracy of the answer to the query (i.e., minimizeapproximation)?

� Under changing conditions, how can the DSMS reallocate memory among operators?

� Suppose we are given a query rather than a query plan. How does the query optimizer efficiently findthe plan that, with the best memory allocation, minimizes approximation? Should plans be modifiedwhen conditions change?

� Even further, since synopses could be shared among query plans [75], how do we optimally considera set of queries, which may be weighted by importance?

In addition to memory management, we are faced the problem of scheduling multiple query plans ina DSMS. The scheduler needs to providerate synchronizationwithin operators (such as stream joins) andalso across pipelined operators in query plans [8, 89]. Time-varying arrival rates of data streams and time-varying output rates of operators add to the complexity of scheduling. Scheduling decisions also need to takeinto account memory allocation across operators, including management of buffers for incoming streams,availability of synopses on disk as opposed to in memory, and the performance requirements of individualqueries.

Aside from the query processing architecture, user and application interfaces need to be reinvestigatedin a DSMS given the dynamic environment in which it operates. Systems such as Aurora [16] and Han-cock [25] completely eliminate declarative querying and provide only procedural mechanisms for querying.In contrast, we will provide a declarative language for continuous queries, similar to SQL but extended withoperators such as those discussed in Section 5.1, as well as a mechanism for directly submitting plans in thequery algebra that underlies our language.

We are developing a comprehensive DSMS interface that allows users and administrators to visuallymonitor the execution of continuous queries, including memory usage and approximation behavior. We willalso provide a way for administrators to adjust system parameters as queries are running, including memoryallocation and scheduling policies.

6 Algorithmic Issues

The algorithms community has been fairly active of late in the area of data streams, typically motivatedby problems in databases and networking. The model of computation underlying the algorithmic work is

16

Page 17: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

similar to that in Section 2 and can be formally stated as follows:A data stream algorithmtakes as input asequence of data itemsd � A � � � A d h A � � � called thedata stream, where the sequence is scanned only once inthe increasing order of the indexes. The algorithm is required to maintain the value of a functionk on theprefix of the stream seen so far.

The main complexity measure is the space used by the algorithm, although the time required to processeach stream element is also relevant. In some cases, the algorithm maintains a data structure which can beused to compute the value of the functionk on demand, and then the time required to process each suchquery also becomes of interest. Henzinger et al. [49] defined a similar model but also allowed the algorithmto make multiple passes over the stream data, making the number of passes itself a complexity measure. Wewill restrict our attention to algorithms which are allowed only one pass.

We will measure space and time in terms of the parameterl which denotes the number of streamelements seen so far. The primary performance goal is to ensure that the space required by a stream algorithmis “small.” Ideally, one would want the memory bound to be independent ofl (which is unbounded).However, for most interesting problems it is easy to prove a space lower bound that precludes this possibility,thereby forcing us to settle for bounds that are merely sublinear inl . A problem is considered to be “well-solved” if one can devise an algorithm which requires at mostm > n p q r > t v x l E E space andm > n p q r > t v x l E Eprocessing time per data element or query1. We will see that in some cases it is impossible to achieve suchan algorithm, even if one is willing to settle for approximations.

The rest of this section summarizes the state of the art for data stream algorithms, at least as relevant todatabases. We will focus primarily on the problems of creating summary structures (synopses) for a datastream, such as histograms, wavelet representation, clustering, and decision trees; in addition, we will alsotouch upon known lower bounds for space and time requirements of data stream algorithms. Most of thesesummary structures have been considered for traditional databases [13]. The challenge is to adapt some ofthese techniques to the data stream model.

6.1 Random Samples

Random samples can be used as a summary structure in many scenarios where a small sample is expectedto capture the essential characteristics of the data set [65]. It is perhaps the easiest form of summarizationin a DSMS and other synopses can be built from a sample itself. In fact, the join synopsis in the AQUAsystem [2] is nothing but a uniform sample of the base relation. Recently stratified sampling has beenproposed as an alternative to uniform sampling to reduce error due to the variance in data and also to reduceerror for group-by queries [1, 19]. To actually compute a random sample over a data stream is relativelyeasy. The reservoir sampling algorithm of Vitter [90] makes one pass over the data set and is well suited forthe data stream model. There is also an extension by Chaudhuri, Motwani and Narasayya [22] to the case ofweighted sampling.

6.2 Sketching Techniques

In their seminal paper, Alon, Matias and Szegedy [5] introduced the notion ofrandomized sketchingwhichhas been widely used ever since. Sketching involves building a summary of a data stream using a smallamount of memory, using which it is possible to estimate the answer to certain queries (typically, “distance”queries) over the data set.

Let z ; > d � A � � � A d h E be a sequence of elements where eachd } belongs to the domain~ ; = 4 A � � � A � L .Let the multiplicity � } ; � = � � d � ; C L � denote the number of occurrences of valueC in the sequencez . For� 2 � , the

�th frequency moment� � of z is defined as� � ; � �} � � � �} ; further, we define� �� ; � � � } � } .

1We use� � � � to denote a polynomial function.

17

Page 18: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

The frequency moments capture the statistics of the distribution of values inz — for instance,� � is thenumber of distinct values in the sequence,� � is the length of the sequence,� is the self-join size (alsocalled Gini’s index of homogeneity), and� � is the most frequent item’s multiplicity. It is not very difficultto see that an exact computation of these moments requires linear space and so we focus our attention onapproximations.

The problem of efficiently estimating the number of distinct values (� � ) has received particular attentionin the database literature, particularly in the context of using single pass or random sampling algorithms [18,46]. A sketching technique to compute� � was presented earlier by Flajolet and Martin [35]; however, thishad the drawback of requiring explicit families of hash functions with very strong independence properties.This requirement was relaxed by Alon, Matias, and Szegedy [5] who presented a sketching technique toestimate� � within a constant factor2. Their technique uses linear hash functions and requires onlym > t v x � Ememory. The key contribution of Alon et al. [5] was a sketching technique to estimate� that uses only

m > t v x � � t v x l E space and provides arbitrarily small approximation factors. This technique has found manyapplications in the database literature, including join size estimation [4], estimating� � norm of vectors [33],and processing complex aggregate queries over multiple streams [27, 37]. It remains an open problem tocome up with techniques to maintain correlated aggregates [37] that haveprovableguarantees.

The key idea behind the� -sketching technique can be described as follows:Every elementC in thedomain ~ is hashed uniformly at random onto a value¡ } ¢ = £ 4 A � 4 L . Define the random variable¤ ;

� } � } ¡ } and return ¤ as the estimator of� . Observe that the estimator can be computed in a singlepass over the data provided we can efficiently compute the¡ } values. If the hash functions have four-wayindependence3, it is easy to prove that the quantity¤ has expectation equal to� and variance less than§ � . Using standard tricks, we can combine several independent estimators to accurately and with highprobability obtain an estimate of� . At an intuitive level, we can view this technique as a tug-of-war whereelements are randomly assigned to one of the two sides of the rope based on the valueC ; the square of thedisplacement of the rope captures the skew� in the data.

Observe that computing the self-join size of a relation is exactly the same as computing� for the valuesof the join attribute in the relation. Alon et al. [4] extended this technique to estimating the join size betweentwo distinct relations© and � , as follows. Letª and « be random variables corresponding to© and � ,respectively, similar to the random variable¤ above; the mapping from domain valuesC to ¡ } is the samefor both relations. Then, it can be proved that the estimatorª « has expected value equal to� © ¬ � �and variance less than

§ � © ¬ © � � � ¬ � � . In order to get small relative error, we can usem > ³ ´ µ ´ ³ ³ · µ · ³³ ´ µ · ³ » Eindependent estimators. Observe that for estimating joins between two relations, the number of estimatorsdepends on the data distribution. In a recent paper, Dobra et al. [27] extended this technique to estimatethe size of multi-way joins and for answering complex aggregates queries over them. They also providetechniques to optimally partition the data domain and use estimators on each partition independently, so asto minimize the total memory requirement.

The frequency moment� can also be viewed as the� norm of a vector whose value along theC thdimension is the multiplicity� } . Thus, the above technique can be used to compute the� norm under aupdate model for vectors, where each data element

> ½ A C E increments (or decrements) some� } by a quantity½. On seeing such an update, we update the corresponding sketch by adding

½ ¡ } to the sum. The sketchingidea can also be extended to compute the� � norm of a vector, as follows. Let us assume that each dimensionof the underlying vector is an integer, bounded by¿ . Consider the unary representation of the vector. Ithas ¿ � bit positions (elements), where� is the dimension of the underlying vector. A4 in the unary

2As discussed in Section 6.7, recently Bar-Yossef et al. [12] and Gibbons and Tirthapura [38] have devised algorithms which,under certain conditions, provide arbitrarily small approximation factors without recourse to perfect hash functions.

3Hash functions with four-way independence can be obtained using standard techniques involving the use of parity checkmatrices of BCH codes [65].

18

Page 19: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

representation denotes that the element corresponding to the bit position is present once; otherwise, it is notpresent. Then� captures the� � norm of the vector. The catch is that, given an elementÁ } along dimensionC , it is required that we can efficiently compute the range sum� Â Ä Å �� � � ¡ } Æ � of the hash values correspondingto the pertinent bit positions that are set to4 . Feigenbaum et al. [33] showed how to construct such a familyof range-summableÇ 4 -valued hash functions with limited (four-way) independence. Indyk [50] provideda uniform framework to compute the� È norm (for

n ¢ > � A § É) using the so-called

n-stable distributions,

improving upon the previous paper [33] for estimating the� � norm, in that it allowed for arbitrary additionand deletion updates in every dimension. The ability to efficiently compute� � and � norm of the differenceof two vectors is central to some synopsis structures designed for data streams.

6.3 Histograms

Histograms are commonly-used summary structures to succinctly capture the distribution of values in adata set (i.e., a column, or possibly even a collection of columns, in a table). They have been employedfor a multitude of tasks such as query size estimation, approximate query answering, and data mining. Weconsider the summarization of data streams using histograms. There are several different types of histogramsthat have been proposed in the literature. Some popular definitions are:

� V-Optimal Histogram: These approximate the distribution of a set of values½ � A � � � A ½ K

by a piecewise-constant functionÊ ½ > C E , so as to minimize the sum of squared error� } > ½ } £ Ê ½ > C E E .

� Equi-Width Histograms: These partition the domain into buckets such that the number of½ } values

falling into each bucket is uniform across all buckets. In other words, they maintain quantiles for theunderlying data distribution as the bucket boundaries.

� End-Biased Histograms: These will maintain exact counts of items that occur with frequency abovea threshold, and approximate the other counts by an uniform distribution. Maintaining the count ofsuch frequent items is related to Iceberg queries [32].

We give an overview of recent work on computing such histograms over data streams.

V-Optimal Histograms over Data Streams

Jagadish et al. [54] showed how to compute optimal V-Optimal Histograms for a given data set using dy-namic programming. The algorithm usesm > l E space and requiresm > l � E time, wherel is the size of thedata set and� is the number of buckets. This is prohibitive for data streams. Guha, Koudas and Shim [43]adapted this algorithm tosorteddata streams. Their algorithm constructs an arbitrarily-close V-OptimalHistogram (i.e., with error arbitrarily close to that of the optimal histogram), usingm > � t v x l E space and

m > � t v x l E time per data element.In a recent paper, Gilbert et al. [39], removed the restriction that the data stream be sorted, providing

algorithms based on the sketching technique described earlier for computing� norms. The idea is to vieweach data element as an update to an underlying vector of lengthl that we are trying to approximate usingthe best� -bucket histogram. The time to process a data element, the time to reconstruct the histogram, andthe size of the sketch are each bounded by

n p q r > � A t v x l A 4 � Î E , where Î is the relative error we are willingto tolerate. Their algorithm proceeds by first constructing arobustapproximation to the underlying “signal.”A robustapproximation is built by repeatedly adding a dyadic interval of constant value4 which best reducesthe approximation error. In order to find such a dyadic interval it is necessary to efficiently compute the

4A signal that corresponds to a constant value over the dyadic interval and zero everywhere else.

19

Page 20: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

sketch of the original signal minus the constant dyadic interval5. This translates to efficiently computingthe range sum of

n-stable random variables (used for computing the� sketch, see Indyk [50]) over the

dyadic interval. Gilbert et al. [39] show how to construct such efficiently range-summablen-stable random

variables. From the robust histogram they cull a histogram of desired accuracy and with� buckets.

Equi-Width Histograms and Quantiles

Equi-width histograms based on histograms are summary structures which characterize data distributions ina manner that is less sensitive to outliers. In traditional databases they are used by optimizers for selectivityestimation. Parallel database systems employ value range data partitioning that requires generation of quan-tiles or splitters that partition the data into approximately equal parts. Recently, Greenwald and Khanna [41]presented a single-pass deterministic algorithm for efficient computation of quantiles. Their algorithm needs

m > �Ï t v x Î l E space and guarantees a precision ofÎ l . They employ a novel data structure that maintains asample of the values seen so far (quantiles), along with a range of possible ranks that the samples can take.The error associated with each quantile is the width of this range. They periodically merge quantiles with“similar” errors so long as the error for the combined quantile does not exceedÎ l . This algorithm improvesupon the previous set of results by Manku, Rajagopalan, and Lindsay [61, 62] and Chaudhuri, Motwani, andNarasayya [21].

End-Biased Histograms and Iceberg Queries

Many applications maintain simple aggregates (count) over an attribute to find aggregate values above aspecified threshold. These queries are referred to asiceberg queries[32]. Such iceberg queries arise inmany applications, including data mining, data warehousing, information retrieval, market basket analysis,copy detection, and clustering. For example, a search engine might be interested in gathering search termsthat account for more than 1% of the queries. Such frequent item summaries are useful for applications suchas caching and analyzing trends. Fang et al. [32] gave an efficient algorithm to compute Iceberg queries overdisk-resident data. Their algorithm requires multiple passes which is not suited to the streaming model. Ina recent paper, Manku and Motwani [60] presented randomized and deterministic algorithms for frequencycounting and iceberg queries over data streams. The randomized algorithm uses adaptive sampling andthe main idea is that any item which accounts for anÎ fraction of the items is highly likely to be a partof a uniform sample of size�Ï . The deterministic algorithm maintains a sample of the distinct items alongwith their frequency. Whenever a new item is added, it is given a benefit of doubt by over-estimating itsfrequency. If we see an item that already exists in the sample, its frequency is incremented. Periodicallyitems with low frequency are deleted. Their algorithms requirem > �Ï t v x > Î l E E space, wherel is the lengthof the data stream, and guarantee that any element is undercounted by at mostÎ l . Thus, these algorithmsreport all items of count greater thanÎ l . Moreover, for all items reported, they guarantee that the reportedcount is less than the actual count, but by no more thanÎ l .

6.4 Wavelets

Wavelets are often used as a technique to provide a summary representation of the data. Wavelets coefficientsare projections of the given signal (set of data values) onto an orthogonal set of basis vector. The choice ofbasis vectors determines the type of wavelets. Often Haar wavelets are used in databases for their ease ofcomputation. Wavelet coefficients have the desirable property that the signal reconstructed from the top fewwavelet coefficients best approximates the original signal in terms of the� norm.

5That is, a sketch forÐ » norm of the difference between the original signal and the dyadic interval with constant value.

20

Page 21: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

Recent papers have demonstrated the efficacy of wavelets for different tasks such as selectivity esti-mation [63], data cube approximation [93] and computing multi-dimensional aggregates [92]. This bodyof work indicates that estimates obtained from wavelets were more accurate than those obtained from his-tograms with the same amount of memory. Chakrabarti et al. [17] propose the use of wavelets for generalpurpose approximate query processing and demonstrate how to compute joins, aggregations, and selectionsentirely in the wavelet coefficient domain.

To extend this body of work to data streams, it becomes important to devise techniques for computingwavelets in the streaming model. In a related development, Matias, Vitter, and Wang [64] show how todynamically maintain the top wavelet coefficients efficiently as the underlying data distribution is updated.There has been recent work in computing the top wavelet coefficients in the data stream model. The tech-nique of Gilbert et al. [39], to approximate the best dyadic interval that most reduces the error, gives riseto an easy greedy algorithm to find the best� -term Haar wavelet representation. This is because the Haarwavelet basis consists of dyadic intervals with constant values. This improves upon a previous result byGilbert et al. [40]. If the data is presented in a sorted order, there is a simple algorithm that maintains thebest � -term Haar wavelet representation usingm > � � t v x l E space in a deterministic manner [40].

While there has been lot of work on summary structures, it remains an interesting open problem toaddress the issue of global space allocation between different synopses vying for the same space. It requiresthat we come up with a global error metric for the synopses, which we minimize given the (main memory)space constraint. Moreover, the allocation will have to be dynamic as the underlying data distribution andquery workload changes over time.

6.5 Sliding Windows

As discussed in Section 4, sliding windows preventstaledata from influencing analysis and statistics, andalso serve as a tool for approximation in face of bounded memory. There has been very little work onextending summarization techniques to sliding windows and it remains a ripe research area. We brieflydescribe some of the recent work.

Datar et al. [26] showed how to maintain simple statistics over sliding windows, including the sketchesused for computing the� � or � norm. Their technique requires a multiplicative space overhead ofm > �Ï t v x l E ,where l is the length of the sliding window andÎ is the accuracy parameter. This enables the adaptationof the sketching-based algorithms to the sliding windows model. They also provide space lower bounds forvarious problems in the sliding windows model. In another paper, Babock, Datar and Motwani [9] adaptthe reservoir sampling algorithm to the sliding windows case. In their paper for computing Iceberg queriesover data streams, Manku and Motwani [60] also present techniques to adapt their algorithms to the slidingwindow model. Guha and Koudas [42] have adapted their earlier paper [43], to provide a technique formaintaining V-Optimal Histograms over sorted data streams for the sliding window model; however, theyrequire the buffering of all the elements in the sliding window. The space requirement is linear in the size ofthe sliding window (l ), although update time per data element is amortized tom > > � � � Î E t v x � l E , where

� is the number of buckets in the histogram andÎ is the accuracy parameter.Some open problems for sliding windows are: clustering, maintaining top wavelet coefficients, main-

taining statistics like variance, and computing correlated aggregates [37].

6.6 Negative Results

There is an emerging set of negative results on space-time requirements of algorithms that operate in datastream model. Henzinger, Raghavan, and Rajagopalan [49] provided space lower bounds for concrete prob-lems in the data stream model. These lower bounds are derived from results in communication complex-ity [56]. To understand the connection, observe that the memory used by any one-pass algorithm for a

21

Page 22: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

function k , after seeing a prefix of the data stream, is lower bounded by the one-way communication re-quired by two parties trying to computek where the first party has access to the same prefix and the secondparty has access to the suffix of the stream that is yet to arrive. Henzinger et al. use this approach to providelower bounds for problems such as frequent item counting, approximate median, and some graph problems.

Again based on communication complexity, Alon, Matias and Szegedy [5] provide almost tight lowerbounds for computing the frequency moments. In particular they proved a lower bound ofÑ > l E for estimat-ing � � , the count of the most frequent item, wherel is the domain size. At first glance this lower boundand a similar lower bound in Henzinger et al. [49] may seem to contradict the frequent item-set countingresults of Manku and Motwani [60]. But note that the latter paper estimates the count of the most frequentitem only if it exceedsÎ l . Such skewed distributions are common in practice, while the lower bounds areproven for pathological distributions where items have near-uniform frequency. This serves as a reminderthat while it may be possible to prove strong space lower bounds for stream computations, considerationsfrom applications sometimes enable us to circumvent the negative results.

Saks and Sun [73] provide space lower bounds for distance approximation between two vectors underthe � È norm, for

n � Ò , in the data stream model. Munro and Paterson [66] showed that any algorithm thatcomputes quantiles exactly in

npasses requiresÑ > l � Ó È E space. Space lower bounds for maintaining simple

statistics like count, sum, min/max, and number of distinct values under the sliding windows model can befound in the work of Datar et al. [26].

A general lower bound technique for sampling-based algorithms was presented by Bar-Yossef et al. [11].It is useful for deriving space lower bounds for data stream algorithms that resort to oblivious sampling. Itremains an interesting open problem to obtain similar general lower bound techniques for other classesof algorithms for the data stream model. We feel that techniques based on communication complexityresults [56] will prove useful in this context.

6.7 Miscellaneous

In this section, we give a potpourri of algorithmic results for data streams.

Data Mining

Decision trees are another form of synopsis used for prediction. Domingos et al. [28, 29] have studiedthe problem of maintaining decision trees over data streams. Clustering is yet another way to summarizedata. Consider the

�-median formulation for clustering: GivenÔ data points in a metric space, the objective

is to choose�

representative points, such that the sum of the errors over theÔ data points is minimized.The “error” for each data point is the distance from that point to the nearest of the

�chosen representative

points. Guha et al. [44] presented a single-pass algorithm for maintaining approximate�-medians (cluster

centers) that usesm > lÏ

E space for someÎ � 4 using m > n p q r > t v x l E E amortized time per data element,to compute a constant factor approximation to the

�-median problem. Their algorithm uses a divide-and-

conquer approach which works as follows: Clustering proceeds hierarchically, where a small number> l

ÏE

of the original data points are clustered into�

centers. These�-centers are weighted by the number of

points that are closest to them in the local solution. When we getlÏ

weighted cluster centers by clusteringdifferent sets, we cluster them into higher-level cluster centers, and so on.

Multiple Streams

Gibbons and Tirthapura [38] considered the problem of computing simple functions, such as the numberof distinct elements, over unions of data stream. This is useful for applications that work in a distributedenvironment, where it is not feasible to send all the data to a central site for processing. It is important to

22

Page 23: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

note that some of the techniques presented earlier, especially those that are based onsketching, are amenableto distributed computation over multiple streams.

Reductions of Streams

In a recent paper, Bar-Yossef, Kumar, and Sivakumar [12] introduce the notion ofreductionsin streamingalgorithms. In order for the reductions to be efficient, one needs to employlist-efficientstreaming algorithms.The idea behind list-efficient streaming algorithms is that instead of being presented one data item at a time,they are implicitly presented with a list of data items in a succinct form. If the algorithm can efficientlyprocess the list in time that is a function of thesuccinct representation size, then it is said to be list-efficient.They develop some list-efficient algorithms and using the reduction paradigm address several interestingproblems like computing frequency moments [5] (which includes the special case of counting the numberof distinct elements) and counting the number of triangles in a graph presented as a stream. They also provea space lower bound for the latter problem. To the best of our knowledge, besides this work and that ofHenzinger et al. [49], there has been little work on graph problems in the streaming model. Such algorithmswill likely be very useful for analyzing large graphical structures such as the web graph.

Property Testing

Feigenbaum et al. [34] introduced the concept ofstreaming property testersandstreaming spot checkers.These are programs that make one pass over the data and using small space verify if the data satisfiescertain property. They show that there are properties that are efficiently testable by a streaming-tester butnot by a sampling-tester, and other problems for which the converse is true. They also present an efficientsampling-tester for testing the “groupedness” property of a sequence that usem > Õ l E samples,m > Õ l Espace andm > Õ l t v x l E time. A sequenceÖ � A � � � A Ö h is said to begroupedif Ö } ; Ö � and C � � � �imply Ö } ; Ö � ; Ö � , i.e., for each typeÙ , all occurrences ofÙ are in a single contiguous run. Thus,groupedness is a natural relaxation of the sortedness property and is a natural property that one may desirein a massive streaming data set. The work discussed here illustrates that some properties are efficientlytestable by sampling algorithms but not streaming algorithms.

Measuring Sortedness

Measuring the “sortedness” of a data stream could be useful in some applications; for example, it is useful indetermining the choice of a sort algorithm for the underlying data. Ajtai et al. [3] have studied the problemof estimating the number of inversions (a measure of sortedness) in a permutation to within a factorÎ , wherethe permutation is presented in a data stream model. They obtained an algorithm usingm > t v x l t v x t v x l Espace andm > t v x l E time per data element. They also prove anÑ > l E space lower bound for randomizedexact computation, thus showing that approximation is essential.

7 Conclusion and Future Work

We have isolated a number of issues that arise when considering data management, query processing, andalgorithmic problems in the new setting of continuous data streams. We proposed some initial solutions, de-scribed past and current work related to data streams, and suggested a general architecture for a Data StreamManagement System (DSMS). At this point let us take a step back and consider some “meta-questions” withregard to the motivations and need for a new approach.

� Is there more to effective data stream systems than conventional database technology with enhancedsupport for streaming primitives such as triggers, temporal constructs, and data rate management?

23

Page 24: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

� Is there a need for database researchers to develop fundamental and general-purpose models, algo-rithms, and systems for data streams? Perhaps it suffices to build ad hoc solutions for each specificapplication (network management, web monitoring, security, finance, sensors etc.).

� Are there any “killer apps” for data stream systems?

We believe that all three questions can be answered in the affirmative, although of course only time will tell.Assuming positive answers to the “meta-questions” above, we see several fundamental aspects to the

design of data stream systems, some of which we discussed in detail in this paper. One important generalquestion is the interface provided by the DSMS. Our approach at Stanford is to extend SQL to supportstream-oriented primitives, providing a purely declarative interface as in traditional database systems, al-though we also allow direct submission of query plans. In contrast, the Aurora project [16] provides aprocedural “boxes and arrows” approach as the primary interface for the application developer.

Other fundamental issues discussed in the paper include timestamping and ordering, support for slidingwindow queries, and dealing effectively with blocking operators. A major open question, about which wehad very little to say, is that of dealing with distributed streams. It does not make sense to redirect high-rate streams to a central location for query processing, so it becomes imperative to push some processingto the points of arrival of the distributed streams, raising a host of issues at every level of a DSMS [58].Another issue we touched on only briefly in Section 4.5 is that of constraints over streams, and how theycan be exploited in query processing. Finally, many systems questions remain open in query optimization,construction of synopses, resource management, approximate query processing, and the development of anappropriate and well-accepted benchmark for data stream systems.

From a purely theoretical perspective, perhaps the most interesting open question is that of defining ex-tensions of relational operators to handle data stream constructs, and to study the resulting “stream algebra”and other properties of these extensions. Such a foundation is surely key to developing a general-purposewell-understood query processor for data streams.

Acknowledgements

We thank all the members of the Stanford STREAM research group for their contributions and feedback.

References

[1] S. Acharya, P. B. Gibbons, and V. Poosala. Congressional samples for approximate answering ofgroup-by queries. InProc. of the 2000 ACM SIGMOD Intl. Conf. on Management of Data, pages487–498, May 2000.

[2] S. Acharya, P. B. Gibbons, V. Poosala, and S. Ramaswamy. Join synopses for approximate queryanswering. InProc. of the 1999 ACM SIGMOD Intl. Conf. on Management of Data, pages 275–286,June 1999.

[3] M. Ajtai, T. Jayram, R. Kumar, and D. Sivakumar. Counting inversions in a data stream.manuscript,2001.

[4] N. Alon, P. Gibbons, Y. Matias, and M. Szegedy. Tracking join and self-join sizes in limited storage.In Proc. of the 1999 ACM Symp. on Principles of Database Systems, pages 10–20, 1999.

[5] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments.In Proc. of the 1996 Annual ACM Symp. on Theory of Computing, pages 20–29, 1996.

24

Page 25: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

[6] M. Altinel and M. J. Franklin. Efficient filtering of XML documents for selective dissemination ofinformation. InProc. of the 2000 Intl. Conf. on Very Large Data Bases, pages 53–64, Sept. 2000.

[7] A. Arasu, B. Babcock, S. Babu, J. McAlister, and J. Widom. Characterizing memory requirementsfor queries over continuous data streams. InProc. of the 2002 ACM Symp. on Principles of DatabaseSystems, June 2002. Available at http://dbpubs.stanford.edu/pub/2001-49.

[8] R. Avnur and J. Hellerstein. Eddies: Continuously adaptive query processing. InProc. of the 2000ACM SIGMOD Intl. Conf. on Management of Data, pages 261–272, May 2000.

[9] B. Babcock, M. Datar, and R. Motwani. Sampling from a moving window over streaming data. InProc. of the 2002 Annual ACM-SIAM Symp. on Discrete Algorithms, pages 633–634, 2002.

[10] S. Babu and J. Widom. Continuous queries over data streams.SIGMOD Record, 30(3):109–120, Sept.2001.

[11] Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Sampling algorithms: Lower bounds and applications.In Proc. of the 2001 Annual ACM Symp. on Theory of Computing, pages 266–275, 2001.

[12] Z. Bar-Yossef, R. Kumar, and D. Sivakumar. Reductions in streaming algorithms, with an applicationto counting triangles in graphs. InProc. of the 2002 Annual ACM-SIAM Symp. on Discrete Algorithms,pages 623–632, 2002.

[13] D. Barbara et al. The New Jersey data reduction report.IEEE Data Engineering Bulletin, 20(4):3–45,1997.

[14] S. Bellamkonda, T. Borzkaya, B. Ghosh, A. Gupta, J. Haydu, S. Subramanian, and A. Witkowski.Analytic functions in oracle 8i. Available at http://www-db.stanford.edu/dbseminar/Archive/SpringY2000/speakers/agupta/paper.pdf.

[15] J. A. Blakeley, N. Coburn, and P. A. Larson. Updating derived relations: Detecting irrelevant andautonomously computable updates.ACM Trans. on Database Systems, 14(3):369–400, 1989.

[16] D. Carney, U. Cetinternel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul,and S. Zdonik. Monitoring streams – a new class of dbms applications. Technical Report CS-02-01,Department of Computer Science, Brown University, Feb. 2002.

[17] K. Chakrabarti, M. N. Garofalakis, R. Rastogi, and K. Shim. Approximate query processing usingwavelets. InProc. of the 2000 Intl. Conf. on Very Large Data Bases, pages 111–122, Sept. 2000.

[18] M. Charikar, S. Chaudhuri, R. Motwani, and V. Narasayya. Towards estimation error guarantees fordistinct values. InProc. of the 2000 ACM Symp. on Principles of Database Systems, pages 268–279,2000.

[19] S. Chaudhuri, G. Das, and V. Narasayya. A robust, optimization-based approach for approximateanswering of aggregate queries. InProc. of the 2001 ACM SIGMOD Intl. Conf. on Management ofData, pages 295–306, May 2001.

[20] S. Chaudhuri and R. Motwani. On sampling and relational operators.Bulletin of the Technical Com-mittee on Data Engineering, 22:35–40, 1999.

[21] S. Chaudhuri, R. Motwani, and V. Narasayya. Random sampling for histogram construction: Howmuch is enough? InProc. of the 1998 ACM SIGMOD Intl. Conf. on Management of Data, pages436–447, 1998.

25

Page 26: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

[22] S. Chaudhuri, R. Motwani, and V. Narasayya. On random sampling over joins. InProc. of the 1999ACM SIGMOD Intl. Conf. on Management of Data, pages 263–274, June 1999.

[23] S. Chaudhuri and V. Narasayya. An efficient cost-driven index selection tool for microsoft sql server.In Proc. of the 1997 Intl. Conf. on Very Large Data Bases, pages 146–155, 1997.

[24] J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. NiagraCQ: A scalable continuous query system forinternet databases. InProc. of the 2000 ACM SIGMOD Intl. Conf. on Management of Data, pages379–390, May 2000.

[25] C. Cortes, K. Fisher, D. Pregibon, and A. Rogers. Hancock: a language for extracting signaturesfrom data streams. InProc. of the 2000 ACM SIGKDD Intl. Conf. on Knowledge Discovery and DataMining, pages 9–17, Aug. 2000.

[26] M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. InProc. of the 2002 Annual ACM-SIAM Symp. on Discrete Algorithms, pages 635–644, 2002.

[27] A. Dobra, J. Gehrke, M. Garofalakis, and R. Rastogi. Processing complex aggregate queries over datastreams. InProc. of the 2002 ACM SIGMOD Intl. Conf. on Management of Data, 2002.

[28] P. Domingos and G. Hulten. Mining high-speed data streams. InProc. of the 2000 ACM SIGKDD Intl.Conf. on Knowledge Discovery and Data Mining, pages 71–80, Aug. 2000.

[29] P. Domingos, G. Hulten, and L. Spencer. Mining time-changing data streams. InProc. of the 2001ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining, pages 97–106, 2001.

[30] N. Duffield and M. Grossglauser. Trajectory sampling for direct traffic observation. InProc. of the2000 ACM SIGCOMM, pages 271–284, Sept. 2000.

[31] C. Faloutsos, M. Ranganathan, and Y. Manolopoulos. Fast subsequence matching in time-seriesdatabases. InProc. of the 1994 ACM SIGMOD Intl. Conf. on Management of Data, pages 419–429,May 1994.

[32] M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Ullman. Computing iceberg queriesefficiently. InProc. of the 1998 Intl. Conf. on Very Large Data Bases, pages 299–310, 1998.

[33] J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. An approximate l1-difference algorithmfor massive data streams. InProc. of the 1999 Annual IEEE Symp. on Foundations of ComputerScience, pages 501–511, 1999.

[34] J. Feigenbaum, S. Kannan, M. Strauss, and M. Viswanathan. Testing and spot checking of data streams.In Proc. of the 2000 Annual ACM-SIAM Symp. on Discrete Algorithms, pages 165–174, 2000.

[35] P. Flajolet and G. Martin. Probabilistic counting. InProc. of the 1983 Annual IEEE Symp. on Founda-tions of Computer Science, 1983.

[36] H. Garcia-Molina, W. Labio, and J. Yang. Expiring data in a warehouse. InProc. of the 1998 Intl.Conf. on Very Large Data Bases, pages 500–511, Aug. 1998.

[37] J. Gehrke, F. Korn, and D. Srivastava. On computing correlated aggregates over continual data streams.In Proc. of the 2001 ACM SIGMOD Intl. Conf. on Management of Data, pages 13–24, May 2001.

[38] P. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. InProc. ofthe 2001 ACM Symp. on Parallel Algorithms and Architectures, pages 281–291, 2001.

26

Page 27: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

[39] A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Fast, small-space algo-rithms for approximate histogram maintenance. InProc. of the 2002 Annual ACM Symp. on Theory ofComputing, 2002.

[40] A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Surfing wavelets on streams: One-passsummaries for approximate aggregate queries. InProc. of the 2001 Intl. Conf. on Very Large DataBases, pages 79–88, 2001.

[41] M. Greenwald and S. Khanna. Space-efficient online computation of quantile summaries. InProc. ofthe 2001 ACM SIGMOD Intl. Conf. on Management of Data, pages 58–66, 2001.

[42] S. Guha and N. Koudas. Approximating a data stream for querying and estimation: Algorithms andperformance evaluation. InProc. of the 2002 Intl. Conf. on Data Engineering, 2002.

[43] S. Guha, N. Koudas, and K. Shim. Data-streams and histograms. InProc. of the 2001 Annual ACMSymp. on Theory of Computing, pages 471–475, 2001.

[44] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams. InProc. of the 2000Annual IEEE Symp. on Foundations of Computer Science, pages 359–366, Nov. 2000.

[45] A. Gupta, H. V. Jagadish, and I. S. Mumick. Data integration using self-maintainable views. InProc.of the 1996 Intl. Conf. on Extending Database Technology, pages 140–144, Mar. 1996.

[46] P. Haas, J. Naughton, P. Seshadri, and L. Stokes. Sampling-based estimation of the number of distinctvalues of an attribute. InProc. of the 1995 Intl. Conf. on Very Large Data Bases, pages 311–322, Sept.1995.

[47] J. Hellerstein, M. Franklin, et al. Adaptive query processing: Technology in evolution.IEEE DataEngineering Bulletin, 23(2):7–18, June 2000.

[48] J. Hellerstein, P. Haas, and H. Wang. Online aggregation. InProc. of the 1997 ACM SIGMOD Intl.Conf. on Management of Data, pages 171–182, May 1997.

[49] M. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. Technical Report TR1998-011, Compaq Systems Research Center, Palo Alto, California, May 1998.

[50] P. Indyk. Stable distributions, pseudorandom generators, embeddings and data stream computation. InProc. of the 2000 Annual IEEE Symp. on Foundations of Computer Science, pages 189–197, 2000.

[51] Y. E. Ioannidis and V. Poosala. Histogram-based approximation of set-valued query-answers. InProc.of the 1999 Intl. Conf. on Very Large Data Bases, pages 174–185, Sept. 1999.

[52] iPolicy Networks home page. http://www.ipolicynetworks.com.

[53] Z. Ives, D. Florescu, M. Friedman, A. Levy, and D. Weld. An adaptive query execution system for dataintegration. InProc. of the 1999 ACM SIGMOD Intl. Conf. on Management of Data, pages 299–310,June 1999.

[54] H. Jagadish, N. Koudas, S. Muthukrishnan, V. Poosala, K. Sevcik, and T. Suel. Optimal histogramswith quality guarantees. InProc. of the 1998 Intl. Conf. on Very Large Data Bases, pages 275–286,1998.

[55] H. Jagadish, I. Mumick, and A. Silberschatz. View maintenance issues for the Chronicle data model.In Proc. of the 1995 ACM Symp. on Principles of Database Systems, pages 113–124, May 1995.

27

Page 28: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

[56] E. Kushlevitz and N. Nisan.Communication Complexity. Cambridge University Press, 1997.

[57] L. Liu, C. Pu, and W. Tang. Continual queries for internet scale event-driven information delivery.IEEE Trans. on Knowledge and Data Engineering, 11(4):583–590, Aug. 1999.

[58] S. Madden and M. J. Franklin. Fjording the stream: An architecture for queries over streaming sensordata. InProc. of the 2002 Intl. Conf. on Data Engineering, Feb. 2002. (To appear).

[59] S. Madden, J. Hellerstein, M. Shah, and V. Raman. Continuously adaptive continuous queries overstreams. InProc. of the 2002 ACM SIGMOD Intl. Conf. on Management of Data, June 2002. (Toappear).

[60] G. Manku and R. Motwani. Approximate frequency counts over streaming data.manuscript, 2002.

[61] G. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate medians and other quantiles in one passand with limited memory. InProc. of the 1998 ACM SIGMOD Intl. Conf. on Management of Data,pages 426–435, June 1998.

[62] G. Manku, S. Rajagopalan, and B. G. Lindsay. Random sampling techniques for space efficient onlinecomputation of order statistics of large datasets. InProc. of the 1999 ACM SIGMOD Intl. Conf. onManagement of Data, pages 251–262, June 1999.

[63] Y. Matias, J. Vitter, and M. Wang. Wavelet-based histograms for selectivity estimation. InProc. of the1998 ACM SIGMOD Intl. Conf. on Management of Data, pages 448–459, June 1998.

[64] Y. Matias, J. Vitter, and M. Wang. Dynamic maintenance of wavelet-based histograms. InProc. of the2000 Intl. Conf. on Very Large Data Bases, pages 101–110, Sept. 2000.

[65] R. Motwani and P. Raghavan.Randomized Algorithms. Cambridge University Press, 1995.

[66] J. Munro and M. Paterson. Selection and sorting with limited storage.Theoretical Computer Science,12:315–323, 1980.

[67] B. Nguyen, S. Abiteboul, G. Cobena, and M. Preda. Monitoring XML data on the web. InProc. of the2001 ACM SIGMOD Intl. Conf. on Management of Data, pages 437–448, May 2001.

[68] D. S. Parker, R. R. Muntz, and H. L. Chau. The Tangram stream query processing system. InProc. ofthe 1989 Intl. Conf. on Data Engineering, pages 556–563, Feb. 1989.

[69] D. S. Parker, E. Simon, and P. Valduriez. SVP: A model capturing sets, lists, streams, and parallelism.In Proc. of the 1992 Intl. Conf. on Very Large Data Bases, pages 115–126, Aug. 1992.

[70] V. Poosala and V. Ganti. Fast approximate answers to aggregate queries on a data cube. InProc. of the1999 Intl. Conf. on Scientific and Statistical Database Management, pages 24–33, July 1999.

[71] D. Quass, A. Gupta, I. Mumick, and J. Widom. Making views self-maintainable for data warehousing.In Proc. of the 1996 Intl. Conf. on Parallel and Distributed Information Systems, pages 158–169, Dec.1996.

[72] V. Raman, B. Raman, and J. Hellerstein. Online dynamic reordering for interactive data processing. InProc. of the 1999 Intl. Conf. on Very Large Data Bases, 1999.

[73] M. Saks and X. Sun. Space lower bounds for distance approximation in the data stream model. InProc. of the 2002 Annual ACM Symp. on Theory of Computing, 2002.

28

Page 29: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

[74] U. Schreier, H. Pirahesh, R. Agrawal, and C. Mohan. Alert: An architecture for transforming a passiveDBMS into an active DBMS. InProc. of the 1991 Intl. Conf. on Very Large Data Bases, pages 469–478, Sept. 1991.

[75] T. K. Sellis. Multiple-query optimization.ACM Trans. on Database Systems, 13(1):23–52, 1988.

[76] P. Seshadri, M. Livny, and R. Ramakrishnan. Sequence query processing. InProc. of the 1994 ACMSIGMOD Intl. Conf. on Management of Data, pages 430–441, May 1994.

[77] P. Seshadri, M. Livny, and R. Ramakrishnan. Seq: A model for sequence databases. InProc. of the1995 Intl. Conf. on Data Engineering, pages 232–239, Mar. 1995.

[78] P. Seshadri, M. Livny, and R. Ramakrishnan. The design and implementation of a sequence databasesystem. InProc. of the 1996 Intl. Conf. on Very Large Data Bases, pages 99–110, Sept. 1996.

[79] J. Shanmugasundaram, K. Tufte, D. J. DeWitt, J. F. Naughton, and D. Maier. Architecting a net-work query engine for producing partial results. InProc. of the 2000 Intl. Workshop on the Web andDatabases, pages 17–22, May 2000.

[80] R. Snodgrass and I. Ahn. A taxonomy of time in databases. InProc. of the 1985 ACM SIGMOD Intl.Conf. on Management of Data, pages 236–245, 1985.

[81] S.-. Standard. On-line analytical processing (sql/olap). Available from http://www.ansi.org/, document#ISO/IEC 9075-2/Amd1:2001.

[82] Stanford Stream Data Management (STREAM) Project. http://www-db.stanford.edu/stream.

[83] M. Sullivan. Tribeca: A stream database manager for network traffic analysis. InProc. of the 1996Intl. Conf. on Very Large Data Bases, page 594, Sept. 1996.

[84] D. Terry, D. Goldberg, D. Nichols, and B. Oki. Continuous queries over append-only databases. InProc. of the 1992 ACM SIGMOD Intl. Conf. on Management of Data, pages 321–330, June 1992.

[85] Traderbot home page. http://www.traderbot.com.

[86] P. Tucker, D. Maier, T. Sheard, and L. Fegaras. Enhancing relational operators for queryingover punctuated data streams.manuscript, 2002. Available at http://www.cse.ogi.edu/dot/niagara/pstream/punctuating.pdf.

[87] J. Ullman and J. Widom.A First Course in Database Systems. Prentice Hall, Upper Saddle River, NewJersey, 1997.

[88] T. Urhan and M. Franklin. Xjoin: A reactively-scheduled pipelined join operator.IEEE Data Engi-neering Bulletin, 23(2):27–33, June 2000.

[89] S. Viglas and J. Naughton. Rate-based query optimization for streaming information sources. InProc.of the 2002 ACM SIGMOD Intl. Conf. on Management of Data, June 2002. (To appear).

[90] J. Vitter. Random sampling with a reservoir.ACM Trans. on Mathematical Software, 11(1):37–57,1985.

[91] J. Vitter. External memory algorithms and datastructures. In J. Abello, editor,External MemoryAlgorithms, pages 1–18. Dimacs, 1999.

29

Page 30: Models and Issues in Data Stream Systemsmatias/courses/Seminar_Spring03/showDoc.pdf · the ISP’s network resources. Consider the following concrete setting. Network packet traces

[92] J. Vitter and M. Wang. Approximate computation of multidimensional aggregates of sparse data usingwavelets. InProc. of the 1999 ACM SIGMOD Intl. Conf. on Management of Data, pages 193–204,June 1999.

[93] J. Vitter, M. Wang, and B. Iyer. Data cube approximation and histograms via wavelets. InProc. of the1998 Intl. Conf. on Information and Knowledge Management, Nov. 1998.

[94] Xml path language (XPath) version 1.0, Nov. 1999. W3C Recommendation available athttp://www.w3.org/TR/xpath.

[95] Yahoo home page. http://www.yahoo.com.

30