Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and...

Continuous Queries —The Streaming Time Series Case

X. Sean WangDepartment of Information and Software EngineeringGeorge Mason [email protected]

Ph.D. Students: Like Gao, Zhengrong Yao

X. Sean Wang, GMU © 2002

Outline

Continuous queries in general Motivational applications Technical challenges Known projects

Streaming time series case Applications Two scenarios

Algorithms & experimental results

Conclusions


Applications

Network management E.g., IP packets Network performance measure Network monitoring Traffic management

Business applications Financial market: stock tickers Real-time inventory monitoring

Data warehouse — (1) extract relevant data from operational data & (2) monitor aggregated data


Applications (Continued) Web monitoring

Click stream in web tracking News feeds (e.g., Reuters)

Stream of news pieces Subscriptions

Sensor network (upcoming!) Data feed from sensors (check out UC

Berkeley tinyOS and related projects) Telecommunications

Phone calls Location-based real-time applications(!)


Applications (Continued)

Computer infrastructure intrusion detection Not just network management Application-level monitoring

Scientific Data Processing Data on mass storage devices and random

access not supported Sequential scan is like looking at stream

data A little different because query results may

not need to be stream


Applications (Continued) In general

Queries mostly “old” kinds, but “Monitoring” flavor Queries issued once and answered many times Potentially many queries

Technically, two issues Incremental evaluation & multi-query

optimization Different from “triggers” in DBMS

Trigger system: “naïve” implementation Not optimized for fast data rate & number of

triggers


Current Projects

NiagaraCQ: U. Wisconsin Madison Stream project: Stanford U. Bjord: UC Berkeley Aurora: Brown U. & Stonebraker TameCQ: GMU (Tracking and Monitoring

Engine based on Continuous Queries)


Related Projects & Research Subjects

Xyleme: INRIA project, subscription Stream data model: UCLA On-line query processing Content-based network routing Temporal & spatial data (especially

moving objects) Materialized views (incremental updates)


Time Series Case —

Similarity-Based Continuous Queries on Time Series


Continuous queries on streaming time series

querying time series

Lm+1

L1+1

L0+1

…

Sensors continuously send in patient data: streaming time series S *

ECG database containing disease signatures

m+1 Signatures(patterns)

Nearest neighbor of S at each time position

* Creighton University Ventricular Tachyarrhythmia Database: http://www.physionet.org/physiobank/database/cudb/

…


Similarity-based Queries on time series

Given a set of pattern time series and a query time series TS , find the nearest neighbor of TS . (may also be k-nearest, h-near …)

Related work Whole match queries: LQ = LP

R. Agrawal et al [FODO’ 93] D. Rafiei and A. O. Mendelzon [TKDE’ 00]

Subsequence match queries: LQ < LP

C. Faloutsos et al [SIGMOD’ 94]: fixed / variable length query series T. Kahveci & A.K. Singh [ICDE’ 01]: variable length query series

Our problem Continuous queries … For the querying time series: sliding windows with variable lengths

based on patterns LQ0 = LP0

LQ1 = LP1 …

LQm = LPm


Streaming time series S is an infinite real number sequence whose values arrive at the query process system sequentially.

Nearest neighbor of S at time position p .

Nearest neighbors of S : the sequence of the nearest neighbors of streaming S at all positions.

Streaming time series and its neighbors

Lmax

Lmin

NNp


What’s our problem? Challenges

Nearest neighbor queries are computationally costly. Fast processing is necessary.

Assumptions of this paper Number of patterns is small and all patterns have

already been loaded into the main memory. Pattern set consists of time series with variable lengths. Weighted Euclidian Distance is the similarity measure. Computational cost, i.e. CPU time, is the dominant

factor.

Performance goal Fast processing.


Our solution

CQP algorithm Batch process Prediction


One batch process

Batch Process (1) Find the distances from each pattern to streaming

series at multiple positions instead of doing so separately.

Fi

p-98

P+1

Fi

p-97

P+2

Fi

p-96

P+3

S

D2(S, Fi) positionp: i

mFS CpCCorrmpS

i

]99[*2][0

99,

2

p+1:

im

FS CpCCorrmpSi

]98[*2][1

98,

2

p+2:

im

FS CpCCorrmpSi

]97[*2][2

97,

2

Ci : constantCCorrS,Fi[d ], d=p-99, …, p-92: Cross correlation function of S and Fi, which can be fast evaluated with an 8-point IFFT .

: incremental calculation

][2 mpS

Fi

p-92

P+7

…p+3:

im

FS CpCCorrmpSi

]96[*2][3

96,

2

p+7:

im

FS CpCCorrmpSi

]92[*2][7

92,

2

………

Fi

pp-99

im

FS

m m mii

mi

CpCCorrmpS

mFmFmpSmpS

mFmpS

i

]99[*2][

][][])99[(2][

]}[])99[({

0

99,

2

0

99

99

0

99

0

22

99

0

2

LFi=100

Un-normalized square distance from S to Fi : D2(S, Fi)

cross correlation of S and Fi with lag of p-99


Batch Process (2) Advantages

Deal with pattern series with variable lengths (weighted distance) other than the same length.

Reduce the process time (O (nlog2 n) vs. O( n2 )) as compared with Sequential Scan for n positions: the longer the FFT is, the better the gain will achieve.

Disadvantage Response time suffers: query system must wait some time

intervals in order to form the sub-series for one batch process.


Before data values come:

apply the batch process with the predicted time series and get the approximated distances from each pattern to streaming time series at all predicted positions. (no waiting)

Once the data at one of the predicted positions arrives: perform the verification procedure within a portion of patterns

provided by the batch Process and find the actual answer from them. (fast evaluation)

Continuous Query with Prediction

CQP algorithm


= 2k )= 2k )= 2k )

CQP procedure


1) Batch process provides the approximated distance from each pattern to the streaming time series

CQP for the nearest neighbor search (1)

3) The triangular relationship alwaysholds for each pattern.

iesg time sery: queryin

iesg time sered queryinx: predict

yxDFxDFyDyxDFxD iii ),(),(),(),(),(

2) The prediction error canbe calculated once the actual value arrives.


CQP for the nearest neighbor search (2)

Before the actual data comes, quick sort the predicted distancesMinimum upper bound: minUp

the lower bound is less than or equal to minUp

Candidate Patternsthe lower bound is greater than minUp

Filtered Out Patterns

Dis

tan

ce t

o th

e q

uer

yin

g ti

me

seri

es


Performance Evaluation

Goal: measure performance gain against the sequential scan.

Data set Streaming time series S is generated with a function

of random walk data. 4 data sets are generated from the sub-series of S

with variable lengths: 300~400, 500~600, 700~800 and 300~800.

Use different prediction error models. SQRT LINEAR SQUARE


Error model:Linear trend

Streaming time seriesOne of the

pattern series (length = 780)

Performance Evaluation


Result of the nearest neighbor search

Linear error

Prediction is good, while the overhead of batch process is large.

Prediction is worse, while batch process has less overhead.

Best prediction length for each dataset: tradeoff of batch process and prediction errors.

the average CPU cost with CQP

the average CPU cost with SSCAN

SSCAN CPU cost : 1.5 ~ 3.3ms


Time Series Case — II

Scenario studied Number of patterns is so large that data

have to be stored in the secondary storage All pattern time series have the same length An index is built on the approximations of

the patterns, which could be obtained with dimensionality reduction techniques

I/O operation, i.e. page access, is the dominant cost

Data come in relatively fast


Performance measure: Drop ratio What will happen if data come in too fast?

Some evaluations may take so long that the response time of next one(s) will suffer.

The current and waiting evaluations should be dropped!

Drop ratio Given a time interval, the relative number of

evaluations that are dropped. The limited response strategy

Once the response time of current evaluation exceeds the given response threshold, this evaluation, together with all others in the waiting list, will be dropped.


Evaluation with traditional algorithms

Treat each querying series independently and repeatedly use the traditional algorithms. Dimensionality reduction [e.g. E.J. Keogh et al,

2001, C. Faloutsos et al 1994] Nearest neighbor queries [e.g. N.

Roussopoulos et al. 1995] Optimal multi-step k-nearest neighbor search

[e.g. T.Seidl and H.-P. Kriegel 1998].


Traditional algorithm: one evaluation

Original Pattern data setIndex

Feature space

Original space

NNf

NN search

Near neighbor

search Candidates

NN?

D(TS, NNf ) NN

TS


Direct algorithm Take the result of previous evaluation into account.

The previous result may be the answer to the current evaluation.Referring to previous result has nearly no overhead.

Direct algorithm:

a direct extension of the traditional algorithm by refining the threshold with the previous result.

The more similar the two successive querying time series are, the more significant impact the modified threshold does.


Pre-fetch algorithm (1)

Two weaknesses of Direct algorithm If one evaluation is finished quickly and the

waiting list is empty, the query process will wait for the new data.

The cached buffer loaded by the previous evaluations are managed by the operating system and is not optimized for the next evaluation.

Solution Using the idle time before the new data arrives to

bring in the more useful pages.

Pre-fetch algorithm.


If no evaluation is waiting1) predict the next incoming value;2) form the predicted querying time series and

issue a NN search with this series;Else

same as Direct algorithmEnd if

Pre-fetch algorithm (2) Predict the next querying time series and issue a query with it.

Only need one-step ahead prediction: error is small. The predicted time series is closer to the next querying time

series as compared with the previous querying time series.more useful pages are pre-fetched.

Only use the idle time before the new data arrives to evaluate the query with the predicted querying series.no overhead incurred.


Evaluation cycles

search NN of S at p

S[p+1] comes?

p:= p+1

Yes

waiting

No

Direct /Traditional algorithms

Pre-fetch algorithm

predict S[p+1]No

waiting

Note: all search procedures are subject to Limited Response Strategy.Note: Search procedure with predicted series will be cancelled immediately once the new data arrives.

search NN of at p+1

^

S


Experimental results

Pattern dataset: 106 random walk time series, length=128.

Index: R-tree built on the first 6 DWT coefficients. Streaming time series: random walk time series

with length of 1024, No. of query positions = 897.

Stream rate: data come in every 0.5 (0.6, 0.7,…, 4.5, 5) seconds.

Response threshold: 0.5, 1, 1.5 ,…,4.5, 5 seconds.


Experimental results: performance comparison

Response time threshold is 2.5 seconds.The smaller the value, the better the performance.

interval = 0.8 seconds

Drop ratio Response time

Direct 24% 0.48 seconds

Pre-fetch 19% 0.33 seconds

interval = 3 seconds Drop ratio Response time

Direct 7.5% 0.55 seconds

Pre-fetch 3% 0.12 seconds

fast stream

slow stream


Summary Investigated how to adapt the traditional approaches

in order to efficiently handle the continuous queries.

Future work Optimal drop strategies. All or some pattern series are also streaming data.

Buffer sharing

Result referring

Idle time

Stealing

Tradition

OS, passively No No

Direct OS, passively previous one No

Pre-fetch

OS / pre-fetch, actively

previous / predicted one

Yes


Conclusion Continuous queries are emerging as an important

research area. We introduced a new strategy on evaluating

continuous queries on streaming time series. batch process with the predictions and available data verification procedure once the actual value arrives Introduced drop ratio

Performance much better than “naïve” algorithms.

Future work Our strategy: Batch processing and prediction Apply the strategy to other continuous queries. How about some patterns are also streaming time series

Website for papers: http://ise.gmu.edu/~xywang

Backup slides


Batch Process (2)How fast is the Cross correlation via FFT?

Length of FFT Ratio* (tDIRECT / tFFT)8 1.2

16 1.7

32 2.7

64 4.5

128 7.6

256 13.2

512 23.3

1024 41.8

*Analysis result: http://www.eptools.com/tn/T0001/PT15.HTM


Performance measure: Response time

Response time: the averaged response time for the evaluations that are not dropped.

If drop occurs, the response time must be associated with the drop ratio.

Algorithm Drop ratio Response time

A 20% 40ms

B 60% 40ms

In this case, A is clearly the winner. In more general situations, how to choose the algorithm depends on the real applications.


Direct algorithm

Original Pattern data setIndex

Feature space

Original space

TS

NNf

NN search

Near neighbor

search

NN?

min{D(TS, NNf ),

D(TS, NN-1 )}NN

Direct algorithm: one evaluation

NNf

NN-1Candidates

Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and...

Documents

Transcript of Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and...