Adaptive, Hands-Off Stream Mining - bitquill.netbitquill.net/pdf/stream_vldb03.pdf · Adaptive,...

12
Adaptive, Hands-Off Stream Mining Spiros Papadimitriou * Anthony Brockwell Christos Faloutsos * Computer Science Department * and Department of Statistics Carnegie Mellon University, Pittsburgh, PA, USA {spapadim,christos}@cs.cmu.edu, [email protected] Abstract Sensor devices and embedded processors are becoming ubiquitous. Their limited resources (CPU, memory and/or communication band- width and power) pose some interesting chal- lenges. We need both powerful and concise “languages” to represent the important fea- tures of the data, which can (a) adapt and handle arbitrary periodic components, includ- ing bursts, and (b) require little memory and a single pass over the data. We propose AWSOM (Arbitrary Window Stream mOdeling Method), which allows sen- sors in remote or hostile environments to effi- ciently and effectively discover interesting pat- terns and trends. This can be done automat- ically, i.e., with no user intervention and ex- pert tuning before or during data gathering. Our algorithms require limited resources and can thus be incorporated in sensors, possi- bly alongside a distributed query processing This material is based upon work supported by the Na- tional Science Foundation under Grants No. DMS-9819950 and IIS-0083148. * This material is based upon work supported by the Na- tional Science Foundation under Grants No. IIS-9817496, IIS- 9988876, IIS-0083148, IIS-0113089, IIS-0209107 IIS-0205224 by the Pennsylvania Infrastructure Technology Alliance (PITA) Grant No. 22-901-0001, and by the Defense Advanced Research Projects Agency under Contract No. N66001-00-1-8936. Addi- tional funding was provided by donations from Intel. Any opin- ions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation, DARPA, or other funding parties. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment. Proceedings of the 29th VLDB Conference, Berlin, Germany, 2003 engine [9, 5, 22]. Updates are performed in constant time, using logarithmic space. Ex- isting, state of the art forecasting methods (SARIMA, GARCH, etc) fall short on one or more of these requirements. To the best of our knowledge, AWSOM is the first method that has all the above characteristics. Experiments on real and synthetic datasets demonstrate that AWSOM discovers mean- ingful patterns over long time periods. Thus, the patterns can also be used to make long- range forecasts, which are notoriously difficult to perform. In fact, AWSOM outperforms manually set up auto-regressive models, both in terms of long-term pattern detection and modeling, as well as by at least 10× in re- source consumption. 1 Introduction Several applications produce huge amounts of data in the form of a semi-infinite stream of values [17, 15, 12, 14], typically samples or measurements. For- mally, a stream is a discrete sequence of numbers X 0 ,X 1 ,...,X t ,.... Time sequences have attracted at- tention [6], for forecasting in financial, sales, environ- mental, ecological and biological time series, to men- tion a few. However, several new and exciting appli- cations have recently become possible. The emergence of cheap and small sensors has at- tracted significant attention. Sensors are small devices that gather measurements—for example, temperature readings, road traffic data, geological and astronomi- cal observations, patient physiological data, etc. There are numerous, fascinating applications for such sen- sors and sensor networks, in fields such as health care and monitoring, industrial process control, civil infras- tructure [8], road traffic safety and smart houses, to mention a few. Although current small sensor proto- types [19] have limited resources (512 bytes to 128Kb of storage), dime-sized devices with memory and pro- cessing power equivalent to a PDA are not far away.

Transcript of Adaptive, Hands-Off Stream Mining - bitquill.netbitquill.net/pdf/stream_vldb03.pdf · Adaptive,...

Page 1: Adaptive, Hands-Off Stream Mining - bitquill.netbitquill.net/pdf/stream_vldb03.pdf · Adaptive, Hands-Off Stream Mining ... Pittsburgh, PA, USA {spapadim,christos}@cs.cmu.edu, abrock@stat.cmu.edu

Adaptive, Hands-Off Stream Mining

Spiros Papadimitriou∗ Anthony Brockwell† Christos Faloutsos∗

Computer Science Department∗ andDepartment of Statistics†

Carnegie Mellon University,Pittsburgh, PA, USA

{spapadim,christos}@cs.cmu.edu, [email protected]

Abstract

Sensor devices and embedded processors arebecoming ubiquitous. Their limited resources(CPU, memory and/or communication band-width and power) pose some interesting chal-lenges. We need both powerful and concise“languages” to represent the important fea-tures of the data, which can (a) adapt andhandle arbitrary periodic components, includ-ing bursts, and (b) require little memory anda single pass over the data.

We propose AWSOM (Arbitrary WindowStream mOdeling Method), which allows sen-sors in remote or hostile environments to effi-ciently and effectively discover interesting pat-terns and trends. This can be done automat-ically, i.e., with no user intervention and ex-pert tuning before or during data gathering.Our algorithms require limited resources andcan thus be incorporated in sensors, possi-bly alongside a distributed query processing

†This material is based upon work supported by the Na-tional Science Foundation under Grants No. DMS-9819950 andIIS-0083148.

∗This material is based upon work supported by the Na-tional Science Foundation under Grants No. IIS-9817496, IIS-9988876, IIS-0083148, IIS-0113089, IIS-0209107 IIS-0205224 bythe Pennsylvania Infrastructure Technology Alliance (PITA)Grant No. 22-901-0001, and by the Defense Advanced ResearchProjects Agency under Contract No. N66001-00-1-8936. Addi-tional funding was provided by donations from Intel. Any opin-ions, findings, and conclusions or recommendations expressed inthis material are those of the author(s) and do not necessarilyreflect the views of the National Science Foundation, DARPA,or other funding parties.

Permission to copy without fee all or part of this material isgranted provided that the copies are not made or distributed fordirect commercial advantage, the VLDB copyright notice andthe title of the publication and its date appear, and notice isgiven that copying is by permission of the Very Large Data BaseEndowment. To copy otherwise, or to republish, requires a feeand/or special permission from the Endowment.

Proceedings of the 29th VLDB Conference,Berlin, Germany, 2003

engine [9, 5, 22]. Updates are performed inconstant time, using logarithmic space. Ex-isting, state of the art forecasting methods(SARIMA, GARCH, etc) fall short on one ormore of these requirements. To the best of ourknowledge, AWSOM is the first method thathas all the above characteristics.

Experiments on real and synthetic datasetsdemonstrate that AWSOM discovers mean-ingful patterns over long time periods. Thus,the patterns can also be used to make long-range forecasts, which are notoriously difficultto perform. In fact, AWSOM outperformsmanually set up auto-regressive models, bothin terms of long-term pattern detection andmodeling, as well as by at least 10× in re-source consumption.

1 Introduction

Several applications produce huge amounts of datain the form of a semi-infinite stream of values [17,15, 12, 14], typically samples or measurements. For-mally, a stream is a discrete sequence of numbersX0, X1, . . . , Xt, . . .. Time sequences have attracted at-tention [6], for forecasting in financial, sales, environ-mental, ecological and biological time series, to men-tion a few. However, several new and exciting appli-cations have recently become possible.

The emergence of cheap and small sensors has at-tracted significant attention. Sensors are small devicesthat gather measurements—for example, temperaturereadings, road traffic data, geological and astronomi-cal observations, patient physiological data, etc. Thereare numerous, fascinating applications for such sen-sors and sensor networks, in fields such as health careand monitoring, industrial process control, civil infras-tructure [8], road traffic safety and smart houses, tomention a few. Although current small sensor proto-types [19] have limited resources (512 bytes to 128Kbof storage), dime-sized devices with memory and pro-cessing power equivalent to a PDA are not far away.

Page 2: Adaptive, Hands-Off Stream Mining - bitquill.netbitquill.net/pdf/stream_vldb03.pdf · Adaptive, Hands-Off Stream Mining ... Pittsburgh, PA, USA {spapadim,christos}@cs.cmu.edu, abrock@stat.cmu.edu

Automobile − Full

Time

Aut

omob

ile

0 5000 10000 15000

050

010

0015

0020

00

Automobile − Day

Time

Aut

omob

ile

0 1000 2000 3000 4000

050

010

0015

0020

00

Automobile − Hour

Time

Aut

omob

ile

500 550 600 650 700

050

010

0015

0020

00

Figure 1: Automobile traffic, complete series, one dayand one hour (the first two are 8- and 4-point averagesto make the trends easier to see). There is clearly adaily periodicity. Also, in each day there is anotherdistinct pattern (morning and afternoon rush hours).However, at an hour scale traffic is highly bursty—infact, it can be modeled by self-similar noise. We wanta method that can capture all this information auto-matically, with one pass and using limited memory!

In fact, PDA-like devices with data gathering unitsare already being employed in some of the above ap-plications. The goal in the next decade is single-chipcomputers with powerful processors and 2–10Gb [8]of nonvolatile storage. Furthermore, embedded pro-cessors are becoming ubiquitous and their power hasyet to be harnessed. A few examples of such applica-tions are (a) intelligent (active) disks [24] that learninput traffic patterns and do appropriate prefetchingand buffering, (b) intelligent routers that monitor datatraffic and simplify network management.

From now on, we use the term “sensor” broadly, torefer to any embedded computing device with fairlylimited processing, memory and (optionally) commu-nication resources and which generates a semi-infinitesequence of measurements.

The resource limitations unavoidably imply theneed for certain trade-offs—it is impossible to store ev-erything. Furthermore, we want to make the most ofavailable resources, allowing the sensor to adapt andoperate without supervision for as long as possible.This is the problem we address in this work. Thegoal is a “language” (i.e., model/representation) forefficient and effective, automatic stream mining. Wewant to collect information, in real-time and withoutany human intervention, and discover patterns such as“the hourly phone call volume so far follows a daily anda weekly periodicity, with bursts roughly every year”(which a human might recognize as, e.g., the Mother’sday surge).

This problem is orthogonal to that of continuousquery processing. We focus on an adaptive algorithmthat can look for arbitrary patterns and requires noinitial human intervention to guide it. There are sit-uations when we do not know beforehand what weare looking for. Furthermore, it may be impossibleto guide the sensor as it collects data, due to the largevolume of data and/or limited or unavailable commu-nication. If further exploration is desired, users canissue further queries, guided by the general long-termpatterns to quickly narrow down the “search space.”

In detail, the main requirements are (see also Fig-ure 1): (1) No human in the loop: In a generalsensor setting we cannot afford human intervention.(2) Periodic component identification: Humanscan achieve this task, visually, from the time-plot. Ourmethod should automatically spot multiple periodiccomponents, each of unknown, arbitrary period. (3)Online, one-pass algorithm: We can afford nei-ther the memory or time for offline updates, much lessmultiple passes over the data stream. (4) Limitedmemory: Sensor memory will be exhausted, unlessour method carefully detects redundancies (or equiva-lently, patterns) and exploits them. Furthermore, wewant to collect data even when network connectivityis intermittent (e.g., due to power constraints) or evennon-existent. (5) Simple, but powerful patterns:We need simple patterns (i.e., equations, rules) whichcan be easily communicated to other nearby sensorsand/or to a central processing site. These patternsshould be powerful enough to capture most of the reg-ularities in real-world signals. (6) Any-time fore-casting/outlier detection: It is not enough to docompression (e.g., of long silence periods, or by ignor-ing small Fourier or wavelet coefficients). The modelshould be generative and thus able report outliers. Anoutlier can be defined as any value that deviates toomuch from our forecast (e.g., by two standard devia-tions). AWSOM has all of these characteristics, whilenone of the previously published methods (AR andvariations, Fourier analysis, wavelet decomposition—see Section 2.2) can claim the same.

2 Related work

Previous work broadly falls into two categories. Thefirst includes work done by the databases communityon continuous query processing. These methods em-ploy increasingly sophisticated mathematical methods,but the focus is typically on some form of compression(or, synopses) which do not employ generative models.The other includes various statistical methods for timeseries forecasting. However, these require the interven-tion of trained statisticians and are computationallyintensive.

Thus the authors believe that there is a need forstraightforward methods of time series model build-ing which can be applied in real-time to semi-infinitestreams of data, using limited memory.

2.1 Continuous queries and stream processing

An interesting method for discovering representativetrends in time series using sketches was proposed byIndyk et. al. [20]. A representative trend is a sectionof the time series that has the smallest sum of “dis-tances” from all other sections of the same length. Theproposed method employs random projections for di-mensionality reduction and FFT to quickly computethe sum of distances. However, it cannot be applied

Page 3: Adaptive, Hands-Off Stream Mining - bitquill.netbitquill.net/pdf/stream_vldb03.pdf · Adaptive, Hands-Off Stream Mining ... Pittsburgh, PA, USA {spapadim,christos}@cs.cmu.edu, abrock@stat.cmu.edu

to semi-infinite streams, since each section has to becompared to every other.

Gilbert et. al. [17] use wavelets to “compress” thedata into a fixed amount of memory, by keeping trackof the largest Haar wavelet coefficients and carefullyupdating them on-line (in the following, we will usethe name Incremental DWT or IncDWT for short).However, this method does not try to discover pat-terns and trends in the data. Thus, it cannot competedirectly with our method, which employs a generativemodel. More recently, Garofalakis et. al. [14] presentedan approach for accurate data compression using prob-abilistic wavelet synopses. However, this method hasan entirely different focus and cannot be applied tosemi-infinite streams. The recent work of Zhang et.al. [18] efficiently constructs ε-accurate histograms ina stream setting, but these synopsize data over a fixedtime window. Further work on streams focuses on pro-viding exact answers to pre-specified sets of queries us-ing the minimum amount of memory possible. Arvindet. al. [2] study the memory requirements of continuousqueries over relational data streams. Datar et. al. [11]keep exact summary statistics and provide theoreticalbounds in the setting of a bit stream.

There is also recent work on approximate answers tovarious types of continuous queries. Gehrke et. al. [15]presents a comprehensive approach for answering cor-related aggregate queries (e.g., “find points below the(current) average”), using histogram “summaries” toapproximate aggregates. Dobra et. al. [12] present amethod for approximate answers to aggregate multi-join queries over several streams, using random pro-jections and boosting.

Zhang et. al. [28] present a framework for spatio-temporal joins using multiple-granularity indices. Ag-gregation levels are pre-specified and the main focusis on efficient indexing. Also, [26] present a methodfor analysis of multiple co-evolving sequences. A sys-tem for linear pattern discovery on multi-dimensionaltime series was presented recently in [10]. Althoughthis framework employs varying resolutions in time,it does so by straight aggregation, using manually se-lected aggregation levels (although the authors discussthe use of a geometric progression of time frames) andcan only deal with, essentially, linear trends. Finally,very recently, Bulut et. al. [7] proposed an approachfor hierarchical stream summarization (similar to thatof [17]) which focuses on simple queries and communi-cation/caching issues for wavelet coefficients.

2.2 Time series methods

None of the continuous querying methods deal withpattern discovery and forecasting. The typical ap-proaches to forecasting (i.e., generative time se-ries models) include the traditional auto-regressive(AR) models and their generalizations, auto-regressivemoving average (ARMA), auto-regressive integrated

moving average (ARIMA) and seasonal ARIMA(SARIMA) [6]. Although popular, these methods failto meet many of the requirements listed in the intro-duction. The most important failure is that they needhuman intervention and fine-tuning. As mentioned instatistics textbooks such as [6]:

“The first step in the analysis of any timeseries is to plot the data. [...] Inspectionof a graph may also suggest the possibility ofrepresenting the data as a realization of [the‘classical decomposition’ model].”

Thus, such methods are not suited for remote, unsu-pervised operation. Furthermore, these methods havea number of other limitations.

Existing model-fitting methods are typically batch-based (i.e., do not allow online update of parameters).Established methods for determining model structureare at best computationally intensive, besides not eas-ily automated.

If there are non-sinusoidal periodic components,ARIMA models will miss them completely. Large win-dow sizes introduce severe estimation problems, bothin terms of resource requirements as well as accuracy.

In addition, ARIMA models cannot handle burstytime series, even when the bursts are re-occurring.While GARCH (generalized auto-regressive condi-tional heteroskedasticity) models [4] can handle theclass of bursty white noise sequences, the computa-tional difficulties involved are prohibitive. Recently,the ARIMA model has been extended to ARFIMA(auto-regressive fractionally integrated moving aver-age), which handles the class of self-similar burstysequences [3]. However, ARFIMA is even harder touse than ARIMA.

All the above methods deal with linear forecast-ing. Non-linear modeling methods [25] also require hu-man intervention to choose the appropriate windowsfor non-linear regression or to configure an artificialneural network.

2.3 Other

There is a large body of work in the signal process-ing literature related to compression and feature ex-traction. Typical tools include the Fast Fourier Trans-form (FFT), as well as the Discrete Wavelet Transform(DWT) [23]. However, most of the algorithms (a) dealwith fixed length signals of size N , and (b) cannot doforecasting (i.e., do not employ a generative model).

3 Background material

In this section we give a very brief introduction to somenecessary background material.

3.1 Auto-regressive modeling

Auto-regressive models are the most widely known andused. Due to space constraints, we present the basic

Page 4: Adaptive, Hands-Off Stream Mining - bitquill.netbitquill.net/pdf/stream_vldb03.pdf · Adaptive, Hands-Off Stream Mining ... Pittsburgh, PA, USA {spapadim,christos}@cs.cmu.edu, abrock@stat.cmu.edu

Symbol Definition

Xt Value at time t = 0, 1, . . ..N Number of points so far from {Xt}.Wl,t Wavelet coefficient (level l, time t).Vl,t Scaling coefficient (level l, time t).βδl,δt AWSOM coefficient, (δl, δt) ∈ D).D Set of window offsets for AWSOM.AWSOM(n0, . . . , nλ) Offsets per level (n0, . . . , nλ) in

D—see Definition 5. (n0, . . . , nλ)is also called the AWSOM order.

λ Depth of AWSOM model, λ ≥ 0.k Total order of an AWSOM model

k ≡ |D| =∑λ

l=0 nl.

Table 1: Symbols and definitions.

0 4 8 12

13

14

1511

10

951

2

3 7

69

0 1 2 3 4 5 6 7

8 10 11

12 13

1415

Bases

freq

uenc

y

time

Scalogram

Figure 2: Haar bases and correspondence totime/frequency (for signal length N = 16). Eachwavelet coefficient is a linear projection of the signalto the respective basis.

ideas—more information can be found in, e.g., [6]. Themain idea is to express Xt as a function of its previousvalues, plus (filtered) noise εt:

Xt = φ1Xt−1 + . . . + φW Xt−W + εt (1)

where W is a window that is determined by trial anderror, or by using a criterion that penalizes model com-plexity (i.e., large values of W ), like the Akaike In-formation Criterion (AIC). Seasonal variants (SAR,SAR(I)MA) also use window offsets that are multiplesof a single, fixed period (i.e., besides terms of the formXt−i, the equation contains terms of the form Xt−Si

where S is a constant). AR(I)MA requires prepro-cessing by trained statisticians to remove trends andseasonalities, typically by visual inspection of the se-quence itself, as well as its Auto-Correlation Function(ACF).Recursive least squares. Recursive Least Squares(RLS ) is a method that allows dynamic update of aleast-squares fit. More details are given in Appendix Band in [27].

3.2 Wavelets

The N -point discrete wavelet transform (DWT) of alength N time sequence gives N wavelet coefficients.Wavelets are best introduced with the Haar transform,because of its simplicity (a more rigorous introductioncan be found, e.g., in [23]). At each level l of theconstruction we keep track of two sets of coefficients,each of which “looks” at a time window of size 2l:

• Wl,t: The smooth component, which consists ofthe N/2l scaling coefficients. These capture thelow-frequency component of the signal; in partic-ular, the frequency range [0, 1/2l].• Vl,t: The detail component, which consists of the

N/2l wavelet coefficients. These capture the high-frequency component; in particular, the range[1/2l, 1/2l−1].

The construction starts with V0,t = Xt and W0,t is notdefined. At each iteration l = 1, 2, . . . , lg N we performtwo operations on Vl−1,t to compute the coefficients atthe next level:• Differencing, to extract the high frequencies:

Wl,t = (Vl−1,2t − Vl−1,2t−1)/√

2• Smoothing, which averages1 each consecutive pair

of values and extracts the low frequencies:Vl,t = (Vl−1,2t + Vl−1,2t−1)/

√2

We stop when Wl,t consists of one coefficient (whichhappens at l = lg N + 1). The scaling coefficientsare needed only during the intermediate stages of thecomputation. The final wavelet transform is the set ofall wavelet coefficients along with Vlg N+1,0. Startingwith Vlg N+1,0 (which is also referred to as the signal’sscaling coefficient) and following the inverse steps, wecan reconstruct each Vl,t until we reach V0,t ≡ Xt.

Figure 2 illustrates the final effect for a signal withN = 16 values. Each wavelet coefficient is the result ofprojecting the original signal onto the correspondingbasis signal. Figure 2 shows the scalogram, that is,the energy (i.e., squared magnitude) of each waveletcoefficient versus the location in time and frequency itis “responsible” for. In general, there are many wavelettransforms, but they all follow the pattern above: awavelet transform uses a pair of filters, one high-passand one low-pass.

For our purposes here, we shall restrict ourselves towavelets of the Daubechies family, which have desir-able smoothness properties and successfully compressmany real signals. In practice, although by far themost commonly used (largely due to their simplicity),Haar wavelets are too unsmooth and introduce signif-icant artifacting [23]. In fact, unless otherwise speci-fied, we use Daubechies-6.Incremental wavelets. This part is a very briefoverview of how to compute the DWT incrementally.This is the main idea of IncDWT [17], which uses Haarwavelets. In general, when using a wavelet filter oflength L, the wavelet coefficient at a particular levelis computed using the L corresponding scaling coeffi-cients of the previous level. Recall that L = 2 for Haar(average and difference of two consecutive points), andL = 6 for Daubechies-6 that we typically use. Thus,we need to remember the last L−1 scaling coefficientsat each level. We call these the wavelet crest.

1The scaling factor of 1/√

2 in both the difference and av-eraging operations is present in order to preserve total signalenergy (i.e., sum of squares of all values).

Page 5: Adaptive, Hands-Off Stream Mining - bitquill.netbitquill.net/pdf/stream_vldb03.pdf · Adaptive, Hands-Off Stream Mining ... Pittsburgh, PA, USA {spapadim,christos}@cs.cmu.edu, abrock@stat.cmu.edu

Definition 1 (Wavelet crest) The wavelet crest attime t is defined as the set of scaling coefficients(wavelet smooths) that need to be kept in order to com-pute the new wavelet coefficients when Xt arrives.

Lemma 1 (DWT update) Updating the waveletcrest requires space (L − 1) lg N + L = O(L lg N) =O(lg N), where L is the width of the wavelet filter(fixed) and N the number of values seen so far.

Proof See [17]. Generalizing to non-Haar waveletsand taking into account the wavelet filter width isstraightforward. �

3.2.1 Wavelet properties

In this section we emphasize the DWT propertieswhich are relevant to AWSOM.Computational complexity. The DWT can becomputed in O(N) time and, as new points arrive, itcan be updated in O(1) amortized time. This is madepossible by the structure of the time/frequency decom-position which is unique to wavelets. For instance, theFourier transform also decomposes a signal into fre-quencies (i.e., sum of sines), but requires O(N lg N)time to compute and cannot be updated as new pointsarrive.Time/frequency decomposition. Notice (seescalogram in Figure 2) that higher level coefficientsare highly localized in time, but involve uncertaintyin frequency and vice-versa. This is a fundamentaltrade-off of any time/frequency representation and isa manifestation of the uncertainty principle, accordingto which localization in frequencies is inversely propor-tional to localization in time. The wavelet represen-tation is an excellent choice when dealing with semi-infinite streams in limited memory: it “compresses”well many real signals, while it is fast to compute andcan be updated online.Wavelets and decorrelation. A wavelet transformwith length 2L filter can decorrelate only certain sig-nals provided their L-th order (or less) backward dif-ference is a stationary random process [23]. For realsignals, this value of L is not known in advance andmay be impractically large: the space complexity ofcomputing new wavelet coefficients is O(L lg N)—seeLemma 1.Wavelet variance. One further benefit of usingwavelets is that they decompose the variance acrossscales. Furthermore, the plot of log-power versus scalecan be used to detect self-similar components (see Ap-pendix A for a brief overview).

4 Proposed method

In this section we introduce our proposed model.What equations should we be looking for to replaceARIMA’s (see Equation 1)?

Contin. Trends / Auto-Method Streams Forecast matic Memory

DFT (N-point) NO NO — —SWFT (N-point) YES(?) NO — —

DWT (N-point) NO NO — —IncDWT [17] YES NO — —

Sketches [20] NO YES(?) — —

AR / ARIMA YES YES NO [6] W 2

AWSOM YES YES YES m|D|2

Table 2: Comparison of methods.

freq

uenc

y

time

(a)time

freq

uenc

y

(b)

Figure 3: AWSOM—Intuition and demonstration.AWSOM captures intra-scale correlations (a). Also,(b) demonstrates why we fit different models per level.

4.1 Intuition behind our method

First part—information representation. Tradi-tional models (such as ARIMA) operate directly inthe time domain. Thus, they cannot deal with redun-dancies, seasonalities, long-range behavior, etc. Thisis where a human expert is needed to manually detectthese phenomena and transform the series to matchARIMA’s assumptions.

This is a crucial choice—is there a better one? Wewant a powerful and flexible representation that canadapt to the sequence, rather than expect someone toadapt the sequence to the representation. We proposeto use wavelets because they are extremely successfulin compressing most real signals, such as voice and im-ages [13], seismic data [29], biomedical signals [1] andeconomic time sequences [16]. By using wavelet co-efficients, we immediately discard many redundancies(i.e., near-zero valued wavelet coefficients) and focuswhat really matters. Furthermore, the DWT can becomputed quickly and updated online.Second part—correlation. In the wavelet domain,how can we capture arbitrary periodicities? A periodicsignal will have high-energy wavelet coefficients at thescales that correspond to its frequency. Also, succes-sive coefficients on the same level should have relatedvalues (see Figure 3(a)). Thus, in order to capture pe-riodic components, we should look for intra-scale cor-relations between wavelet coefficients.

How should we capture bursts? Short bursts carryenergy in most frequencies. Therefore wavelet coeffi-cients across scales will carry a large energy (see Fig-ure 14(a)). If the phenomenon follows some pattern,then it is likely that there will be an inter-scale cor-

Page 6: Adaptive, Hands-Off Stream Mining - bitquill.netbitquill.net/pdf/stream_vldb03.pdf · Adaptive, Hands-Off Stream Mining ... Pittsburgh, PA, USA {spapadim,christos}@cs.cmu.edu, abrock@stat.cmu.edu

UpdateCrest (X[t]):Foreach l ≥ 0 s.t. 2l divides t:

Compute V [l, t/2l]If 2l+1 divides t:

Compute W [l, t/2l+1]Delete W [l, t/2l+1 − L]

Update (X[t]):UpdateCrest(X[t])Foreach new coefficient W [l, t′] in the crest:

Find the linear model it belongs tobased on l and t′ mod Λ

Update XT X and XT y for this model

ModelSelection:For each linear model:

Estimate SSR of complete modelFor each subset of regression variables:

Compute SSR of reduced modelfrom 4

Estimate probability that reductionin variance is not due to chance

Select the subset of variables withhighest probability (or keep allif not within 95% confidence interval)

Figure 4: High-level description of the algorithms.

relation among several of the wavelet coefficients (seeAppendix D for more details).

The last question we need to answer is: what typeof regression models should we use to quantify thesecorrelations? Our proposed method tries to captureinter- and intra-scale correlations by fitting a linearregression model in the wavelet domain. These canalso be updated online with RLS.

To summarize, we propose using the wavelet rep-resentation of the series and capturing correlations inthe wavelet domain (see Figure 3(b)).

4.2 AWSOM modeling

In the simplest case, we try to express the wavelet co-efficients at each level as a function of the n0 previouscoefficients of the same level, i.e.,

Wl,t = β0,1Wl,t−1 + β0,2Wl,t−2 + · · ·+ β0,n0Wl,t−n0

where Wl,t are the wavelet coefficients (at time t andlevel l) and β

(l)0,i are the AWSOM coefficients (for level

l and lag i). We estimate one set of such coefficientsfor each level l. This is a model of order n0, denotedas AWSOM(n0).

This can capture arbitrary periodic components andis sufficient in many real signals. In general, besideswithin-scale correlations, we may also try to captureacross-scale correlations by also including terms of theform βδl,δtWl−δl,t/2δl−δt in the above equation. Whenwe use nδl coefficients from each level in the regressionmodels, the order is (n0, n1, . . . , nδl) and the model isdenoted by AWSOM(n0, n1, . . . , nδl). See Appendix Dfor details.

4.3 Model selection

Many of the dependences may be statistically insignif-icant, so the respective coefficients should be set tozero. We want to (a) avoid over-fitting and (b) presentto the user those patterns that are important. Wecan do model selection and combination—technicaldetails are in Appendix C. This can be performedusing only data gathered online and time complex-ity is independent of the stream size. The only thingthat needs to be decided in advance is the largestAWSOM(n0, . . . , nλ) order we may want to fit. Fromthe data collected, we can then automatically select

Dataset Size Description

Triangle 64K Triangle wave (period 256)Mix 256K Square wave (period 256)

plus sine (period 64)

Sunspot 2K Sunspot dataAutomobile 32K Automobile traffic sensor trace

from large Midwestern state

Table 3: Description of datasets (sizes are in numberof points, 1K=1024 points).

any model of smaller order (AWSOM(n′0, . . . , n′λ′),

where λ′ ≤ λ and n′i ≤ ni).

4.4 Complexity

In this section we show that our proposed AWSOMmodels can be easily estimated with a single-pass,“any-time” algorithm. From Lemma 1, estimatingthe new wavelet coefficients requires space O(lg N).In fact, since we typically use Daubechies-6 wavelets(L = 6), we need to keep exactly 5 lg N + 6 values.The AWSOM models can be dynamically updated us-ing RLS.

The number of active levels La depends on N andthe value of T is fixed and depends only on the depthλ. In particular:

T ∼ 2λ and La ∼ lgN

NaT= lg N − lg NΛ − λ

This leads to the following result:

Lemma 2 (Logarithmic space complexity)Maintaining the model requires O(lg N + mk2) space,where N is the length of the signal so far, k is thetotal AWSOM order and m = LaT the number ofequations.

Proof Keeping the wavelet crest scaling coefficientsrequires space O(lg N). If we use recursive leastsquares, we need to maintain a k × k matrix for eachof the m equations in the model. �

Auto-regressive models with a comparable windowsize need space O(m2k2), since the equivalent fair win-dow size is W ≈ mk. Here, “fair” means that thenumber of total number of AWSOM coefficients plusthe number of initial conditions we need to store isthe same for both methods. This is the informationthat comprises the data synopsis and that would have

Page 7: Adaptive, Hands-Off Stream Mining - bitquill.netbitquill.net/pdf/stream_vldb03.pdf · Adaptive, Hands-Off Stream Mining ... Pittsburgh, PA, USA {spapadim,christos}@cs.cmu.edu, abrock@stat.cmu.edu

Triangle Mix Imptrain Sunspot Automobile

AWSOMAR(fair)

Mem

ory

requ

irem

ents

(a)

Time vs. size

stream size (N)

time

(t)

4096 8192 16384 32768 65536 131072

5.74

711

.115

35.5

6170

.778

139.

733

ActualFit − correlation 0.99995

(b)

Figure 5: (a) Memory space requirements (normal-ized): Space needed to keep the models up-to-date(AWSOM and AR with equivalent, fair window size.(b) Time complexity versus stream size (Python pro-totype), including model selection; the relationship isexactly linear, as expected.

to be eventually communicated. However, the de-vice gathering the measurements needs extra storagespace in order to update the models. The latter is, infact, much larger for AR than for AWSOM (see Fig-ure 5(a)). Thus this definition of equivalent windowactually favors AR.

Lemma 3 (Time complexity) Updating the modelwhen a new data point arrives requires O(k2) time onaverage, where k is the number of AWSOM coefficientsin each equation.

Proof On average, the wavelet crest scaling coeffi-cients can be updated in O(1) amortized time. Al-though a single step may require O(lg N) time in theworst case, on average, the (amortized) time requiredis O(

∑ni=0 B(i)/N) = O(1) (where B(i) is the number

of trailing zeros in the binary representation of i)2.Updating the k × k matrix for the appropriate linearequation (which can be identified in O(1), based onlevel l and on t modT ), requires time O(k2). �

Auto-regressive models with a comparable windowsize need O(m2k2) time per update.

Corollary 1 (Constant-time update) When themodel parameters have been fixed (typically k is asmall constant ≈ 10 and m ∼ lg N), the modelrequires space O(lg N) and amortized time O(1) foreach update.

Figure 5(b) shows that this is clearly the case, as ex-pected.

5 Experimental evaluation

We compared AWSOM against standard AR (withthe equivalent, fair window size—see Section 4.4), aswell as hand-tuned (S)ARIMA (wherever possible).Our prototype AWSOM implementation is written inPython, using Numeric Python for fast array manip-ulation. We used the standard ts package from R

2Seen differently, IncDWT is essentially an pre-order traver-sal of the wavelet coefficient tree.

2 4 6 8 10 12 14

0.01

0.02

0.05

0.10

0.20

0.50

1.00

Automobile

Corr = −0.502, Slope = −0.157Level

Wav

elet

var

ianc

e (p

ower

)

2 4 6 8 10

0.1

0.2

0.5

1.0

Sunspot

Corr = +0.305, Slope = 0.101Level

Wav

elet

var

ianc

e (p

ower

)

Figure 6: Wavelet variance diagnostic. Automobileexhibits self-similarity in scales up to 6 (which roughlycorresponds to one hour) but not overall.

Triangle − AWSOM (4)

Time

Tria

ngle

33000 34000 35000 36000

−4

−2

02

4

Triangle − AR (72)

Time

Tria

ngle

33000 34000 35000 36000

−4

−2

02

4

FAILED

Triangle − SAR x (1) 256

Mix − AWSOM (4)

Time

Mix

130500 131500 132500 133500

−30

−20

−10

010

2030

Mix − AR (58)

Time

Mix

130500 131500 132500 133500

−30

−20

−10

010

2030

FAILED

Mix − SAR (1) x (1) 256

Figure 7: Forecasts—synthetic datasets. Note that ARgives the wrong trend (if any), while seasonal AR failsto complete.(version 1.6.0—see http://www.r-project.org/) forAR and (S)ARIMA models. We illustrate the proper-ties of AWSOM and how to interpret the models usingsynthetic datasets and then show how these apply toreal datasets (see Table 3).

Only the first half of each sequence was used to esti-mate the models, which were then applied to generatea sequence of length equal to that of the entire sec-ond half. For AR and (S)ARIMA, the last values (asdictated by the lags) of the first half were used to initi-ate generation. For AWSOM we again used as many ofthe last wavelet coefficients from each DWT level of thefirst half as were necessary to start applying the modelequations. We should note that generating more than,say, 10 steps ahead is very rare: most methods in theliterature [25] generate one step ahead, then obtain thecorrect value of Xt+1, and only then try to generateXt+2. Nevertheless, our goal is to capture long-termbehavior and AWSOM achieves this efficiently, unlikeARIMA.

5.1 Interpreting the models

Visual inspection. A “forecast” is essentially a by-product of any generative time series model: applica-tion of any model to generate a number of “future” val-ues reveals precisely the trends and patterns capturedby that model. In other words, synthesizing pointsbased on the model is the simplest way for any userto get a quick, yet fairly accurate idea of what thetrends are or, more precisely, what the model thinks

Page 8: Adaptive, Hands-Off Stream Mining - bitquill.netbitquill.net/pdf/stream_vldb03.pdf · Adaptive, Hands-Off Stream Mining ... Pittsburgh, PA, USA {spapadim,christos}@cs.cmu.edu, abrock@stat.cmu.edu

they are. Thus, what we expect to see (especially ina long-range forecast) is the important patterns thatcan be identified from the real data.

However, an expert user can extract even more pre-cise information from the models. We will now explainhow the “AWSOM language” can be fully interpreted.Variance test. As explained in Appendix A, if thesignal is self-similar, then the plot of log-power versusscale is linear.Definition 2 (Variance diagnostic) The log-power vs. scale plot is the wavelet variance diagnosticplot (or just variance diagnostic). In particular, thecorrelation coefficient ρα quantifies the relation. Ifthe plot is linear (in a range of scales), the slope α̂ isthe self-similarity exponent (−1 < α < 0, closer tozero the more bursty the series).A large value of |ρα|, at least across several scales, in-dicates that the series component in those scales maybe modeled using a fractional noise process with pa-rameter dictated by α (see Automobile). However, weshould otherwise be careful in drawing further conclu-sions about the behavior within these scales.

We should note that after the observation by [21],fractional noise processes and, in general, self-similarsequences have revolutionized network traffic model-ing. Furthermore, self-similar sequences appear inatomic clock fluctuations, river minima, compressedvideo bit-rates [3, 23], to mention a few examples.Wavelet variance (energy and power). The mag-nitude of variance within each scale serves as an indi-cator about which frequency components are the dom-inant ones in the sequence. To precisely interpret theresults, we also need to take into account the fun-damental uncertainty in frequencies (see Figure 13).However, the wavelet variance plot quickly gives usthe general picture of important trends. Furthermore,it guides us to focus on AWSOM coefficients aroundfrequencies with large variance.AWSOM coefficients. Regardless of the energywithin a scale, the AWSOM coefficients provide fur-ther information about the presence of trends, whichcannot be deduced from the variance plots. In par-ticular: (a) Large intra-scale coefficients: Thesecapture patterns at certain frequencies, regardless oftheir energy presence. Furthermore, if the coefficientsare not the same for all regression models at the samelevel, this is an indication of “seasonalities” withinthat scale and captures a different type of informa-tion about lower frequencies. (b) Large inter-scalecoefficients: These occur when there are repeatedbursts (see also Appendix D). The number of scaleswith large inter-scale coefficients depends on burst du-ration (short bursts have large bandwidth).

To summarize, the steps are: (1) Examine the vari-ance diagnostic to identify sub-bands that correspondto a self-similar component. These may be modeled us-ing a fractional noise process for generation purposes;

Automobile − AWSOM (3)

Time

Aut

omob

ile

0 5000 15000 25000

050

010

0015

0020

0025

00

0 500 1000 1500 2000 2500

050

010

0015

0020

0025

00

Actual

For

ecas

t

AWSOM (0.81/0.99)AR (0.02/0.37)

Automobile (Q−Q)

Figure 9: Automobile—generation with fractionalnoise.for forecasting purposes they are just that: noise. (2)Examine the wavelet energy spectrum to quickly iden-tify important sub-bands. (3) Examine AWSOM co-efficients, primarily within and around the sub-bandsidentified during the second step.

5.2 Synthetic datasets

We present synthetic datasets to illustrate the basicproperties of AWSOM, its behavior on several charac-teristic classes of sequences, and the principles behindinterpreting the models. Applying the models to gen-erate a number of “future” data points is the quickestway to see if each method captures long-term patterns.Triangle. AR fails to capture anything, because thewindow is not large enough. SAR estimation (withno differencing, no MA component and only a man-ually pre-specified lag-256 seasonal component) failscompletely. In fact, R segfaults after several min-utes, even without using maximum-likelihood estima-tion (MLE). However, AWSOM captures the period-icity. The model visualization is omitted due to spaceconstraints—see discussion on Mix.Mix. AR is again confused and does not capture eventhe sinusoidal component. SAR estimation (withoutMLE) fails (R’s optimizer returns an error, after sev-eral minutes of computation). Figure 12 shows theAWSOM coefficients. Due to space constraints, weshow only the levels that correspond to significant vari-ance. These illustrate the first point in the interpre-tation of AWSOM coefficients. We clearly see strongcorrelations in levels 6 and 8 (which correspond to theperiods 26 = 64 and 28 = 256 of the series compo-nents). Note that the variance alone (see also Fig-ure 13) is not enough to convey this information.

We also tried SAR(0)×(1)128 on an impulse trainof period 128. On 1024-point sequence, R takes over 4minutes (on a signal with 64K points it did not com-plete in over one hour). However, AWSOM estimatesthe parameters (with 64K points) in approximately50 seconds, although our prototype is implemented inPython.

5.3 Real datasets

Due to space constraints, we omit AWSOM coefficientvisualizations. For the real datasets, we show the

Page 9: Adaptive, Hands-Off Stream Mining - bitquill.netbitquill.net/pdf/stream_vldb03.pdf · Adaptive, Hands-Off Stream Mining ... Pittsburgh, PA, USA {spapadim,christos}@cs.cmu.edu, abrock@stat.cmu.edu

Sunspot − Original

Time

Sun

spot

0 500 1000 20000

5010

015

020

0

Sunspot − AWSOM (6,1)

Time

Sun

spot

0 500 1000 2000

050

100

150

200

Sunspot − AR (60)

Time

Sun

spot

0 500 1000 2000

050

100

150

200

Sunspot − SARIMA (2,1,0) x (1,1,0) 120

Time

Sun

spot

0 500 1000 2000

050

100

150

200

Automobile − Original

Time

Aut

omob

ile

0 10000 20000 30000

050

015

0025

00

Automobile − AWSOM (3)

Time

Aut

omob

ile

0 10000 20000 300000

500

1500

2500

Automobile − AR (36)

Time

Aut

omob

ile

0 10000 20000 30000

050

015

0025

00

FAILED

Automobile − SAR

Figure 8: Forecasts—real datasets. AR fails to detect any trend, while seasonal AR fails to complete or gives awrong conclusion in 260× time.

marginal distribution quantile-quantile plots (or Q-Qplots—see Figure 10 and Figure 9)3.

Sunspot. This is a well-known dataset with a time-varying “period.” AR again fails completely. SAR(without a MA component, much less MLE) takes 40minutes to estimate. AWSOM (in Python) takes lessthan 9 seconds. SAR gives a completely fixed periodand misses the marginal distribution (see Figure 10).AWSOM captures the general periodic trend, with adesirable slight confusion about the “period.”

Automobile. This dataset has a strongly linear vari-ance diagnostic in scales 1–6 (Figure 6). However, thelower frequencies contain the most energy (see Fig-ure 11. This indicates we should focus at these scales.The lowest frequency corresponds to a daily period-icity (approximately 4000 points per day, or about 8periods in the entire series) and next highest frequencycorresponds to the morning and afternoon rush-hours.

In this series, high frequencies can be modeledby fractional noise. Figure 9 shows a generated se-quence with fractional noise, as identified by AWSOM.The fractional difference parameter is estimated asδ̂ ≡ −α̂/2 ≈ 0.276 and the amplitude is chosen tomatch the total variance in those scales.

However, for unsupervised outlier detection, this isnot necessary: what would really constitute an outlierwould be, for instance, days that (a) do not follow thedaily and rush-hour patterns, or (b) whose variance inthe fractional noise scales is very different. This canbe captured automatically by the series componentsin the appropriate frequency sub-bands that AWSOMidentifies as a periodic component and bursty noise,respectively.

3These are the scatter plots of (x, y) such that p% of thevalues are below x in the real sequence and below y in the gen-erated sequence. When the distributions are identical, the Q-Qplot coincides with the bisector of the first quadrant.

0 50 100 150 200−

500

5010

015

0

Actual

For

ecas

t

AWSOM (0.94/0.97)AR (0.05/0.43)SARIMA (0.67/0.97)

Sunspot (Q−Q)

0 500 1000 1500 2000 2500

050

010

0015

0020

0025

00

Actual

For

ecas

t

AWSOM (0.47/0.99)AR (0.02/0.37)

Automobile (Q−Q)

Figure 10: Marginal Q-Q plots (slope and correlationcoefficients in parentheses).

6 Conclusions

Sensor networks are becoming increasingly popular,thanks to falling prices and increasing storage and pro-cessing power. We presented AWSOM, which achievesall of the following goals: (1) Unsupervised operation:once we decide the largest AWSOM order, no furtherintervention is needed: the sensor can be left aloneto collect information. (2) ‘Any-time’, one-pass algo-rithm to incrementally update the patterns. (3) Auto-matic detection of arbitrary periodic components. (4)Limited memory, O(lg N). (5) Simplicity: AWSOMprovides linear models. (6) Power: AWSOM providesinformation across several frequencies and can diag-nose self-similarity and long-range dependence. (7)Immediate outlier detection: our method, despite itssimplicity and its unsupervised operation, is able todo forecasting (directly, for the estimated model). Weshowed real and synthetic data, where our methodcaptures the periodicities and burstiness, while manu-ally selected AR (or even (S)ARIMA generalizations,which are not suitable for streams with limited re-sources) fails completely.

AWSOM is an important first step toward hands-off stream mining, combining simplicity with model-ing power. Continuous queries are useful for evidencegathering and hypothesis testing once we know whatwe are looking for. AWSOM is the first method to

Page 10: Adaptive, Hands-Off Stream Mining - bitquill.netbitquill.net/pdf/stream_vldb03.pdf · Adaptive, Hands-Off Stream Mining ... Pittsburgh, PA, USA {spapadim,christos}@cs.cmu.edu, abrock@stat.cmu.edu

2 4 6 8 10 12 14

0.0

0.2

0.4

0.6

0.8

1.0

Tri256

Level

Wav

elet

var

ianc

e

5 10 15

0.0

0.2

0.4

0.6

0.8

1.0

Mix00

Level

Wav

elet

var

ianc

e

2 4 6 8 10 12 14

0.0

0.2

0.4

0.6

0.8

1.0

Imptrain256

Level

Wav

elet

var

ianc

e

2 4 6 8 10

0.0

0.2

0.4

0.6

0.8

1.0

Sunspot0

Level

Wav

elet

var

ianc

e

2 4 6 8 10 12 14

0.0

0.2

0.4

0.6

0.8

1.0

Auto0

Level

Wav

elet

var

ianc

e

Figure 11: Wavelet variances.

L5(0) [43.31]

Wl,t−0.811−0.195+0.195−0.163

L6(0) [67.23]

Wl,t+0.974

L7(0) [242.9]

Wl,t−0.243+0.244−0.243+0.243

L8(0) [4.610]

Wl,t+0.974

L9(0) [0.000]

Wl,t

Mix

L1(0) [24.95]

Wl,t

+1.000

L1(1) [0.000]

Wl,t

L2(0) [24.95]

Wl,t

+1.000

L2(1) [0.000]

Wl,t

L3(0) [24.96]

Wl,t

+1.000

L3(1) [0.000]

Wl,t

L4(0) [24.97]

Wl,t

+1.000

L4(1) [0.000]

Wl,t

L5(0) [24.98]

Wl,t+0.500

+0.500

L5(1) [0.000]

Wl,t

Impulses

Figure 12: Mix and Impulses—AWSOM models. Thebars on the left are proportional to the variance (seealso Figure 11) and the numbers in brackets show thevariance of the wavelet coefficients that correspond toeach AWSOM equation. Blank boxes correspond to co-efficients excluded by the model selection process.

deal directly with the problem of unsupervised streammining and pattern detection and fill the gap.

Acknowledgments. We thank Becky Buchheit forher help with the automobile traffic datasets.

References

[1] M. Akay, editor. Time Frequency and Wavelets inBiomedical Signal Processing. J. Wiley, 1997.

[2] A. Arasu, B. Babcock, S. Babu, J. McAlister, andJ. Widom. Characterizing memory requirements forqueries over continuous data streams. In PODS, 2002.

[3] J. Beran. Statistics for Long-Memory Processes.Chapman & Hall, 1994.

[4] T. Bollerslev. Generalized autoregressive condi-tional heteroskedasticity. J. Econometrics, 31:307–327, 1986.

[5] P. Bonnet, J. E. Gehrke, and P. Seshadri. Towardssensor database systems. In Proc. MDM, 2001.

[6] P. J. Brockwell and R. A. Davis. Time Series: Theoryand Methods. Springer Series in Statistics. Springer-Verlag, 2nd edition, 1991.

[7] A. Bulut and A. K. Singh. SWAT: Hierarchical streamsummarization in large networks. In Proc. 19th ICDE,2003.

[8] L. R. Carley, G. R. Ganger, and D. Nagle.Mems-based integrated-circuit mass-storage systems.CACM, 43(11):72–80, 2000.

[9] D. Carney, U. Cetintemel, M. Cherniack, C. Convey,S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and

S. B. Zdonik. Monitoring streams – a new class of datamanagement applications. In Proc. VLDB, 2002.

[10] Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang.Multi-dimensional regression analysis of time-seriesdata streams. In Proc. VLDB, 2002.

[11] M. Datar, A. Gionis, P. Indyk, , and R. Motwani.Maintaining stream statistics over sliding windows. InProc. SODA, 2002.

[12] A. Dobra, M. N. Garofalakis, J. Gehrke, and R. Ras-togi. Processing complex aggregate queries over datastreams. In Proc. SIGMOD, 2002.

[13] C. Faloutsos. Searching Multimedia Databases by Con-tent. Kluwer Academic Inc., 1996.

[14] M. N. Garofalakis and P. B. Gibbons. Wavelet syn-opses with error guarantees. In Proc. SIGMOD, 2002.

[15] J. Gehrke, F. Korn, and D. Srivastava. On computingcorrelated aggregates over continual data streams. InProc. SIGMOD, 2001.

[16] R. Gencay, F. Selcuk, and B. Whitcher. An Intro-duction to Wavelets and Other Filtering Methods inFinance and Economics. Academic Press, 2001.

[17] A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, andM. Strauss. Surfing wavelets on streams: One-passsummaries for approximate aggregate queries. InProc. VLDB, 2001.

[18] S. Guha and N. Koudas. Approximating a data streamfor querying and estimation: Algorithms and perfor-mance evaluation. In Proc. ICDE, 2002.

[19] J. Hill, R. Szewczyk, A. Woo, S. Hollar, D. Culler,and K. Pister. System architecture directions for net-worked sensors. In Proc. ASPLOS-IX, 2000.

[20] P. Indyk, N. Koudas, and S. Muthukrishnan. Identi-fying representative trends in massive time series datasets using sketches. In Proc. VLDB, 2000.

[21] W. Leland, M. Taqqu, W. Willinger, and D. Wilson.On the self-similar nature of ethernet traffic. IEEETrans. on Networking, 2(1):1–15, 1994.

[22] S. R. Madden, M. A. Shah, J. M. Hellerstein, andV. Raman. Continuously adaptive continuous queriesover streams. In SIGMOD Conf., 2002.

[23] D. B. Percival and A. T. Walden. Wavelet Methods forTime Series Analysis. Cambridge Univ. Press, 2000.

[24] E. Riedel, C. Faloutsos, G. R. Ganger, and D. Nagle.Data mining on an OLTP system (nearly) for free. InSIGMOD Conf., 2000.

[25] A. S. Weigend and N. A. Gerschenfeld. Time SeriesPrediction: Forecasting the Future and Understandingthe Past. Addison Wesley, 1994.

Page 11: Adaptive, Hands-Off Stream Mining - bitquill.netbitquill.net/pdf/stream_vldb03.pdf · Adaptive, Hands-Off Stream Mining ... Pittsburgh, PA, USA {spapadim,christos}@cs.cmu.edu, abrock@stat.cmu.edu

0.01 0.02 0.05 0.10 0.20 0.50

01

23

45

6 11.0

W3

0.01 0.02 0.05 0.10 0.20 0.50

02

46

810

21.3

W4

0.01 0.02 0.05 0.10 0.20 0.50

05

1015

2025 42.7

W5

Cascade gain (D6)

Figure 13: Daubechies-6 cascade gain (levels 3–5).

[26] B.-K. Yi, N. Sidiropoulos, T. Johnson, H. Jagadish,C. Faloutsos, and A. Biliris. Online data mining forco-evolving time sequences. Proc. ICDE, 2000.

[27] P. Young. Recursive Estimation and Time-SeriesAnalysis: An Introduction. Springer-Verlag, 1984.

[28] D. Zhang, D. Gunopulos, V. J. Tsotras, and B. Seeger.Temporal aggregation over data streams using multi-ple granularities. In Proc. EDBT, 2002.

[29] R. Zuidwijk and P. de Zeeuw. Fast algorithm for di-rectional time-scale analysis using wavelets. In Proc.SPIE, Wavelet Applications in Signal and Image Pro-cessing VI, volume 3458, 1998.

A More wavelet properties

Frequency properties. Wavelet filters employed inpractice can only approximate an ideal bandpass filter,since they are of finite length L. The practical implica-tions are that wavelet coefficients at level l correspondroughly to frequencies [1/2l+1, 1/2l] (or, equivalently,periods [2l, 2l+1] (see Figure 13 for the actual corre-spondence). This has to be taken into account for pre-cise interpretation of AWSOM models by an expert.Wavelet variance and self-similarity. The waveletvariance decomposes the variance of a sequence acrossscales. Due to space limitations, we mention basicdefinitions and facts; details can be found in [23].

Definition 3 (Wavelet variance) If {Wl,t} is theDWT of a series {Xt} then the wavelet variance Vl

is defined as Vl = Var[Wl,t]

Under certain general conditions, V̂l = 2l

N

∑N/2l

t=1 Wl,t

is an unbiased estimator of Vl. Note that the sum isprecisely the energy of {Xt} at scale l.

Definition 4 (Self-similar sequence) A sequence{Xt} is said to be self-similar following a pure power-law process if SX(f) ∝ |f |α, where −1 < α < 0 andSX(f) is the SDF4

It can be shown that Vl ≈ 2∫ 1/2l

1/2l+1 SX(f)df , thus if

{Xt} is self-similar, then logVl ∝ l i.e., the plot of

4The spectral density function (SDF) is the Fourier trans-form of the auto-covariance sequence (ACVS) SX,k ≡Cov[Xt, Xt−k]. Intuitively, it decomposes the variance into fre-quencies.

logVl versus the level l should be linear. In fact, slopeof the log-power versus scale plot should be approxi-mately equal to the exponent α. This fact and how toestimate Vl are what the reader needs to keep in mind.

B Recursive Least Squares (RLS)

The least squares solution to an overdetermined sys-tem of equations Xb = y where X ∈ Rm×k (measure-ments), y ∈ Rm (output variables) and b ∈ Rk (re-gression coefficients to be estimated) is given by thesolution of XT Xb = XT y. Thus, all we need for thesolution are the projections

P ≡ XT X and q ≡ XT y

We need only space O(k2 + k) = O(k2) to keep themodel up to date. When a new row xm+1 ∈ Rk

and output ym+1 arrive, we can update P ← P +xm+1xT

m+1 and q← q+ym+1xm+1 In fact, it is possi-ble to update the regression coefficient vector b with-out explicitly inverting P to solve Pb = q (see [27]).

C Model selection

We show how feature selection can be done from thedata gathered online (i.e., P and q for each AWSOMequation).

C.1 Model testing

Lemma 4 (Square sum of residuals) If b is theleast-squares solution to the overdetermined equationXb = y, then

sn ≡n∑

i=1

(xiT b− yi)2 = bT Pb− 2bT q + y2

Proof Straightforward from the definition of sn,which in matrix form is sn = (Xb− y)2. �

Thus, besides P and q, we only need to update y2 (asingle number), by adding y2

i to it as each new value ar-rives. Now, if we select a subset I = {i1, i2, . . . , ip} ⊆{1, 2, . . . , k} of the k variables x1, x2, . . . , xk, then thesolution bI for this subset is given by PIbI = qIand the SSR by sn = bT

IPIbI − 2bIqI + y2 wherethe subscript I denotes straight row/column selection(e.g., PI = [pij ,ik

]ij ,ik∈I)The F-test (Fisher test) is a standard method for

determining whether a reduction in variance is statis-tically significant. The F-test is based on the sam-ple variances, which can be computed directly fromthe SSR (Lemma 4). Although the F-test holds pre-cisely (i.e., non-asymptotically) under normality as-sumptions, in practice it works well in several circum-stances, especially when the population size is large.This is clearly the case with semi-infinite streams.

C.2 Model combination

If we split measurements xi into two subsets X1

and X2 with corresponding outputs y1 and y2, then

Page 12: Adaptive, Hands-Off Stream Mining - bitquill.netbitquill.net/pdf/stream_vldb03.pdf · Adaptive, Hands-Off Stream Mining ... Pittsburgh, PA, USA {spapadim,christos}@cs.cmu.edu, abrock@stat.cmu.edu

freq

uenc

y

time

(a)time

freq

uenc

y

(b)time

freq

uenc

y

(c)

Figure 14: (a) Inter-scale correlations, intuition. (b,c)Illustration of AWSOM(1, 1) with Li = 2 inactive lev-els. The shade of each wavelet coefficient correspondsto the model equation used to “predict” it. The un-shaded wavelet coefficients correspond to initial condi-tions (i.e., with incomplete AWSOM window D).

the LS solution for both subsets combined is givenby b = (XT X)−1XT y where X =

[XT

1 XT2

]T andy = [yT

1 yT2 ]T , i.e., b = (XT

1 X1 + XT2 X2)−1(XT

1 y1 +XT

2 y2) = (P1+P2)−1(q1+q2). Therefore, it is possibleto combine sub-models when reducing the number oflevels (effectively reducing T ≡ 2λ). Model selection aspresented above can be extended to include this case.

D Full model / Inter-scale correlations

Formally, our proposed method tries to fit models ofthe following form:

Wl,t =∑

(δl,δt)∈D

βδl,δtWl+δl,t/2δl−δt + εl,t (2)

where D is a set of index offsets and εl,t is theusual error term. For example, in Figure 14(b),D = {(0, 1), (0, 2), (1, 0)} and Wl,t = β0,1Wl,t−1 +β0,2Wl,t−2 + β1,0Wl+1,t/2. The βδl,δt are called theAWSOM coefficients.

Definition 5 (AWSOM order) The set of offsets isalways of the form

D = { (0, 1), (0, 2), . . . , (0, n0),

(1, 0), (1, 1), (1, 2), . . . , (1, n1 − 1),

. . . ,

(λ, 0), . . . , (λ, nλ − 1) }

i.e., each wavelet coefficient is expressed as a func-tion of the previous n0 wavelet coefficients on the samelevel, n1 coefficients from one level below and so on.For a particular choice of D, we use

AWSOM(n0, n1, . . . , nλ)to denote this instance of our model. We call(n0, . . . , nλ) the model’s order. The total order is thenumber of AWSOM coefficients k per equation, i.e.,k =

∑λδl=0 nδl and λ is called the depth of the model.

For example, Figure 14(b) shows an AWSOM(2, 1)model. A fixed choice of D is sufficient for all sig-nals. In most of our experiments we have usedAWSOM(6, 4, 2) (total order k = 12).

Impulses − AWSOM (4,2)

Time

Impu

lses

32500 33500 34500

050

100

150

200

Impulses − AR (70)

Time

Impu

lses

32500 33500 34500

050

100

150

200

FAILED

Impulses − SAR x (1) 256

Figure 15: Forecasts—Impulses. The signal consistsof an impulse train (every 256 points), for a total of64K points.

Furthermore, we fit one equation per level (see Fig-ure 14(b)), as long as the level contains enough waveletcoefficients to get a good fit. Thus, we fit one equa-tion for every level l < La. These are the activelevels, where La is the level that has no more than,say, Na = 16 wavelet coefficients. For levels l ≥ La

(the inactive levels), we can either keep the exactwavelet coefficients (which would be no more than16 + 8 + · · · + 1 = 31 in the above case) and/or fitone more equation.

In other words, the number of inactive levels Li isalways fixed to, say, Li = 4 and the number of activelevels La grows as the stream size increases.

When fitting an AWSOM model with depth λ ≥ 1,we also fit different equations depending on time loca-tion t. For instance, if we are using AWSOM1(n0, 2),we should fit one equation for pairs Wl,2t and Wl−1,t

and another for pairs Wl,2t+1 and Wl−1,t (see Fig-ure 14(c)). In general, we need 2λ separate models toensure that the inter-scale correlations λ levels downare not “shoehorned” into the same regression model.

To summarize, the full AWSOM model fits a num-ber of equations:

Wl,t =∑

(δl,δt)∈D

βl′,t′

δl,δtWl+δl,t−δtεl,t (3)

for l′ ≤ La and t′ ≡ t mod T , 0 ≤ t′ < T . For ex-ample, if T = 2, we estimate one linear equation foreach set of wavelet coefficients W0,2i, W0,2i+1, Wl,2i

and Wl,2k+1 (l ≥ 1, i ≥ 0). The significant advantageof this approach is that we can still easily update theAWSOM equations online, as new data values arrive.This is possible because the equation is selected basedonly on l and t for the new wavelet coefficient.Impulses. This synthetic dataset (see Figure 15) il-lustrates how inter-scale correlations may help. ARfails to capture anything (again, too small window)and SAR estimation fails, while AWSOM capturesthe overall behavior. Figure 12 illustrates the secondpoint in the interpretation of AWSOM coefficients. Weclearly see repeated presence of bursts, with stronginter-scale correlations across all levels up to the im-pulse “period” (since the bursts have width one). Dueto space constraints we show only those levels that cor-respond to the bursts. At level 5, information from theimpulse “period” begins to enter in the wavelet coeffi-cients (see also Figure 13). After level 7, the inter-scalecorrelations diminish in significance and the interpre-tation is similar to that for Mix.