Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and...

39
Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University [email protected] Ph.D. Students: Like Gao, Zhengrong Yao
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and...

Page 1: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

Continuous Queries —The Streaming Time Series Case

X. Sean WangDepartment of Information and Software EngineeringGeorge Mason [email protected]

Ph.D. Students: Like Gao, Zhengrong Yao

Page 2: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Outline

Continuous queries in general Motivational applications Technical challenges Known projects

Streaming time series case Applications Two scenarios

Algorithms & experimental results

Conclusions

Page 3: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Applications

Network management E.g., IP packets Network performance measure Network monitoring Traffic management

Business applications Financial market: stock tickers Real-time inventory monitoring

Data warehouse — (1) extract relevant data from operational data & (2) monitor aggregated data

Page 4: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Applications (Continued) Web monitoring

Click stream in web tracking News feeds (e.g., Reuters)

Stream of news pieces Subscriptions

Sensor network (upcoming!) Data feed from sensors (check out UC

Berkeley tinyOS and related projects) Telecommunications

Phone calls Location-based real-time applications(!)

Page 5: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Applications (Continued)

Computer infrastructure intrusion detection Not just network management Application-level monitoring

Scientific Data Processing Data on mass storage devices and random

access not supported Sequential scan is like looking at stream

data A little different because query results may

not need to be stream

Page 6: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Applications (Continued) In general

Queries mostly “old” kinds, but “Monitoring” flavor Queries issued once and answered many times Potentially many queries

Technically, two issues Incremental evaluation & multi-query

optimization Different from “triggers” in DBMS

Trigger system: “naïve” implementation Not optimized for fast data rate & number of

triggers

Page 7: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Current Projects

NiagaraCQ: U. Wisconsin Madison Stream project: Stanford U. Bjord: UC Berkeley Aurora: Brown U. & Stonebraker TameCQ: GMU (Tracking and Monitoring

Engine based on Continuous Queries)

Page 8: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Related Projects & Research Subjects

Xyleme: INRIA project, subscription Stream data model: UCLA On-line query processing Content-based network routing Temporal & spatial data (especially

moving objects) Materialized views (incremental updates)

Page 9: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Time Series Case —

Similarity-Based Continuous Queries on Time Series

Page 10: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Continuous queries on streaming time series

querying time series

Lm+1

L1+1

L0+1

Sensors continuously send in patient data: streaming time series S *

ECG database containing disease signatures

m+1 Signatures(patterns)

Nearest neighbor of S at each time position

* Creighton University Ventricular Tachyarrhythmia Database: http://www.physionet.org/physiobank/database/cudb/

Page 11: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Similarity-based Queries on time series

Given a set of pattern time series and a query time series TS , find the nearest neighbor of TS . (may also be k-nearest, h-near …)

Related work Whole match queries: LQ = LP

R. Agrawal et al [FODO’ 93] D. Rafiei and A. O. Mendelzon [TKDE’ 00]

Subsequence match queries: LQ < LP

C. Faloutsos et al [SIGMOD’ 94]: fixed / variable length query series T. Kahveci & A.K. Singh [ICDE’ 01]: variable length query series

Our problem Continuous queries … For the querying time series: sliding windows with variable lengths

based on patterns LQ0 = LP0

LQ1 = LP1 …

LQm = LPm

Page 12: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Streaming time series S is an infinite real number sequence whose values arrive at the query process system sequentially.

Nearest neighbor of S at time position p .

Nearest neighbors of S : the sequence of the nearest neighbors of streaming S at all positions.

Streaming time series and its neighbors

Lmax

Lmin

NNp

Page 13: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

What’s our problem? Challenges

Nearest neighbor queries are computationally costly. Fast processing is necessary.

Assumptions of this paper Number of patterns is small and all patterns have

already been loaded into the main memory. Pattern set consists of time series with variable lengths. Weighted Euclidian Distance is the similarity measure. Computational cost, i.e. CPU time, is the dominant

factor.

Performance goal Fast processing.

Page 14: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Our solution

CQP algorithm Batch process Prediction

Page 15: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

One batch process

Batch Process (1) Find the distances from each pattern to streaming

series at multiple positions instead of doing so separately.

Fi

p-98

P+1

Fi

p-97

P+2

Fi

p-96

P+3

S

D2(S, Fi) positionp: i

mFS CpCCorrmpS

i

]99[*2][0

99,

2

p+1:

im

FS CpCCorrmpSi

]98[*2][1

98,

2

p+2:

im

FS CpCCorrmpSi

]97[*2][2

97,

2

Ci : constantCCorrS,Fi[d ], d=p-99, …, p-92: Cross correlation function of S and Fi, which can be fast evaluated with an 8-point IFFT .

: incremental calculation

][2 mpS

Fi

p-92

P+7

…p+3:

im

FS CpCCorrmpSi

]96[*2][3

96,

2

p+7:

im

FS CpCCorrmpSi

]92[*2][7

92,

2

………

Fi

pp-99

im

FS

m m mii

mi

CpCCorrmpS

mFmFmpSmpS

mFmpS

i

]99[*2][

][][])99[(2][

]}[])99[({

0

99,

2

0

99

99

0

99

0

22

99

0

2

LFi=100

Un-normalized square distance from S to Fi : D2(S, Fi)

cross correlation of S and Fi with lag of p-99

Page 16: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Batch Process (2) Advantages

Deal with pattern series with variable lengths (weighted distance) other than the same length.

Reduce the process time (O (nlog2 n) vs. O( n2 )) as compared with Sequential Scan for n positions: the longer the FFT is, the better the gain will achieve.

Disadvantage Response time suffers: query system must wait some time

intervals in order to form the sub-series for one batch process.

Page 17: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Before data values come:

apply the batch process with the predicted time series and get the approximated distances from each pattern to streaming time series at all predicted positions. (no waiting)

Once the data at one of the predicted positions arrives: perform the verification procedure within a portion of patterns

provided by the batch Process and find the actual answer from them. (fast evaluation)

Continuous Query with Prediction

CQP algorithm

Page 18: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

= 2k )= 2k )= 2k )

CQP procedure

Page 19: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

1) Batch process provides the approximated distance from each pattern to the streaming time series

CQP for the nearest neighbor search (1)

3) The triangular relationship alwaysholds for each pattern.

iesg time sery: queryin

iesg time sered queryinx: predict

yxDFxDFyDyxDFxD iii ),(),(),(),(),(

2) The prediction error canbe calculated once the actual value arrives.

Page 20: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

CQP for the nearest neighbor search (2)

Before the actual data comes, quick sort the predicted distancesMinimum upper bound: minUp

the lower bound is less than or equal to minUp

Candidate Patternsthe lower bound is greater than minUp

Filtered Out Patterns

Dis

tan

ce t

o th

e q

uer

yin

g ti

me

seri

es

Page 21: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Performance Evaluation

Goal: measure performance gain against the sequential scan.

Data set Streaming time series S is generated with a function

of random walk data. 4 data sets are generated from the sub-series of S

with variable lengths: 300~400, 500~600, 700~800 and 300~800.

Use different prediction error models. SQRT LINEAR SQUARE

Page 22: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Error model:Linear trend

Streaming time seriesOne of the

pattern series (length = 780)

Performance Evaluation

Page 23: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Result of the nearest neighbor search

Linear error

Prediction is good, while the overhead of batch process is large.

Prediction is worse, while batch process has less overhead.

Best prediction length for each dataset: tradeoff of batch process and prediction errors.

the average CPU cost with CQP

the average CPU cost with SSCAN

SSCAN CPU cost : 1.5 ~ 3.3ms

Page 24: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Time Series Case — II

Scenario studied Number of patterns is so large that data

have to be stored in the secondary storage All pattern time series have the same length An index is built on the approximations of

the patterns, which could be obtained with dimensionality reduction techniques

I/O operation, i.e. page access, is the dominant cost

Data come in relatively fast

Page 25: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Performance measure: Drop ratio What will happen if data come in too fast?

Some evaluations may take so long that the response time of next one(s) will suffer.

The current and waiting evaluations should be dropped!

Drop ratio Given a time interval, the relative number of

evaluations that are dropped. The limited response strategy

Once the response time of current evaluation exceeds the given response threshold, this evaluation, together with all others in the waiting list, will be dropped.

Page 26: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Evaluation with traditional algorithms

Treat each querying series independently and repeatedly use the traditional algorithms. Dimensionality reduction [e.g. E.J. Keogh et al,

2001, C. Faloutsos et al 1994] Nearest neighbor queries [e.g. N.

Roussopoulos et al. 1995] Optimal multi-step k-nearest neighbor search

[e.g. T.Seidl and H.-P. Kriegel 1998].

Page 27: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Traditional algorithm: one evaluation

Original Pattern data setIndex

Feature space

Original space

NNf

NN search

Near neighbor

search Candidates

NN?

D(TS, NNf ) NN

TS

Page 28: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Direct algorithm Take the result of previous evaluation into account.

The previous result may be the answer to the current evaluation.Referring to previous result has nearly no overhead.

Direct algorithm:

a direct extension of the traditional algorithm by refining the threshold with the previous result.

The more similar the two successive querying time series are, the more significant impact the modified threshold does.

Page 29: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Pre-fetch algorithm (1)

Two weaknesses of Direct algorithm If one evaluation is finished quickly and the

waiting list is empty, the query process will wait for the new data.

The cached buffer loaded by the previous evaluations are managed by the operating system and is not optimized for the next evaluation.

Solution Using the idle time before the new data arrives to

bring in the more useful pages.

Pre-fetch algorithm.

Page 30: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

If no evaluation is waiting1) predict the next incoming value;2) form the predicted querying time series and

issue a NN search with this series;Else

same as Direct algorithmEnd if

Pre-fetch algorithm (2) Predict the next querying time series and issue a query with it.

Only need one-step ahead prediction: error is small. The predicted time series is closer to the next querying time

series as compared with the previous querying time series.more useful pages are pre-fetched.

Only use the idle time before the new data arrives to evaluate the query with the predicted querying series.no overhead incurred.

Page 31: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Evaluation cycles

search NN of S at p

S[p+1] comes?

p:= p+1

Yes

waiting

No

Direct /Traditional algorithms

Pre-fetch algorithm

predict S[p+1]No

waiting

Note: all search procedures are subject to Limited Response Strategy.Note: Search procedure with predicted series will be cancelled immediately once the new data arrives.

search NN of at p+1

^

S

Page 32: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Experimental results

Pattern dataset: 106 random walk time series, length=128.

Index: R-tree built on the first 6 DWT coefficients. Streaming time series: random walk time series

with length of 1024, No. of query positions = 897.

Stream rate: data come in every 0.5 (0.6, 0.7,…, 4.5, 5) seconds.

Response threshold: 0.5, 1, 1.5 ,…,4.5, 5 seconds.

Page 33: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Experimental results: performance comparison

Response time threshold is 2.5 seconds.The smaller the value, the better the performance.

interval = 0.8 seconds

Drop ratio Response time

Direct 24% 0.48 seconds

Pre-fetch 19% 0.33 seconds

interval = 3 seconds Drop ratio Response time

Direct 7.5% 0.55 seconds

Pre-fetch 3% 0.12 seconds

fast stream

slow stream

Page 34: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Summary Investigated how to adapt the traditional approaches

in order to efficiently handle the continuous queries.

Future work Optimal drop strategies. All or some pattern series are also streaming data.

Buffer sharing

Result referring

Idle time

Stealing

Tradition

OS, passively No No

Direct OS, passively previous one No

Pre-fetch

OS / pre-fetch, actively

previous / predicted one

Yes

Page 35: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Conclusion Continuous queries are emerging as an important

research area. We introduced a new strategy on evaluating

continuous queries on streaming time series. batch process with the predictions and available data verification procedure once the actual value arrives Introduced drop ratio

Performance much better than “naïve” algorithms.

Future work Our strategy: Batch processing and prediction Apply the strategy to other continuous queries. How about some patterns are also streaming time series

Website for papers: http://ise.gmu.edu/~xywang

Page 36: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

Backup slides

Page 37: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Batch Process (2)How fast is the Cross correlation via FFT?

Length of FFT Ratio* (tDIRECT / tFFT)8 1.2

16 1.7

32 2.7

64 4.5

128 7.6

256 13.2

512 23.3

1024 41.8

*Analysis result: http://www.eptools.com/tn/T0001/PT15.HTM

Page 38: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Performance measure: Response time

Response time: the averaged response time for the evaluations that are not dropped.

If drop occurs, the response time must be associated with the drop ratio.

Algorithm Drop ratio Response time

A 20% 40ms

B 60% 40ms

In this case, A is clearly the winner. In more general situations, how to choose the algorithm depends on the real applications.

Page 39: Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and Software Engineering George Mason University xywang@gmu.edu.

X. Sean Wang, GMU © 2002

Direct algorithm

Original Pattern data setIndex

Feature space

Original space

TS

NNf

NN search

Near neighbor

search

NN?

min{D(TS, NNf ),

D(TS, NN-1 )}NN

Direct algorithm: one evaluation

NNf

NN-1Candidates