Digital Writing/ Digital Rhetoric Douglas Eyman George Mason University [email protected].
Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and...
-
date post
20-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of Continuous Queries — The Streaming Time Series Case X. Sean Wang Department of Information and...
Continuous Queries —The Streaming Time Series Case
X. Sean WangDepartment of Information and Software EngineeringGeorge Mason [email protected]
Ph.D. Students: Like Gao, Zhengrong Yao
X. Sean Wang, GMU © 2002
Outline
Continuous queries in general Motivational applications Technical challenges Known projects
Streaming time series case Applications Two scenarios
Algorithms & experimental results
Conclusions
X. Sean Wang, GMU © 2002
Applications
Network management E.g., IP packets Network performance measure Network monitoring Traffic management
Business applications Financial market: stock tickers Real-time inventory monitoring
Data warehouse — (1) extract relevant data from operational data & (2) monitor aggregated data
X. Sean Wang, GMU © 2002
Applications (Continued) Web monitoring
Click stream in web tracking News feeds (e.g., Reuters)
Stream of news pieces Subscriptions
Sensor network (upcoming!) Data feed from sensors (check out UC
Berkeley tinyOS and related projects) Telecommunications
Phone calls Location-based real-time applications(!)
X. Sean Wang, GMU © 2002
Applications (Continued)
Computer infrastructure intrusion detection Not just network management Application-level monitoring
Scientific Data Processing Data on mass storage devices and random
access not supported Sequential scan is like looking at stream
data A little different because query results may
not need to be stream
X. Sean Wang, GMU © 2002
Applications (Continued) In general
Queries mostly “old” kinds, but “Monitoring” flavor Queries issued once and answered many times Potentially many queries
Technically, two issues Incremental evaluation & multi-query
optimization Different from “triggers” in DBMS
Trigger system: “naïve” implementation Not optimized for fast data rate & number of
triggers
X. Sean Wang, GMU © 2002
Current Projects
NiagaraCQ: U. Wisconsin Madison Stream project: Stanford U. Bjord: UC Berkeley Aurora: Brown U. & Stonebraker TameCQ: GMU (Tracking and Monitoring
Engine based on Continuous Queries)
X. Sean Wang, GMU © 2002
Related Projects & Research Subjects
Xyleme: INRIA project, subscription Stream data model: UCLA On-line query processing Content-based network routing Temporal & spatial data (especially
moving objects) Materialized views (incremental updates)
X. Sean Wang, GMU © 2002
Time Series Case —
Similarity-Based Continuous Queries on Time Series
X. Sean Wang, GMU © 2002
Continuous queries on streaming time series
querying time series
Lm+1
L1+1
L0+1
…
Sensors continuously send in patient data: streaming time series S *
ECG database containing disease signatures
m+1 Signatures(patterns)
Nearest neighbor of S at each time position
* Creighton University Ventricular Tachyarrhythmia Database: http://www.physionet.org/physiobank/database/cudb/
…
X. Sean Wang, GMU © 2002
Similarity-based Queries on time series
Given a set of pattern time series and a query time series TS , find the nearest neighbor of TS . (may also be k-nearest, h-near …)
Related work Whole match queries: LQ = LP
R. Agrawal et al [FODO’ 93] D. Rafiei and A. O. Mendelzon [TKDE’ 00]
Subsequence match queries: LQ < LP
C. Faloutsos et al [SIGMOD’ 94]: fixed / variable length query series T. Kahveci & A.K. Singh [ICDE’ 01]: variable length query series
Our problem Continuous queries … For the querying time series: sliding windows with variable lengths
based on patterns LQ0 = LP0
LQ1 = LP1 …
LQm = LPm
X. Sean Wang, GMU © 2002
Streaming time series S is an infinite real number sequence whose values arrive at the query process system sequentially.
Nearest neighbor of S at time position p .
Nearest neighbors of S : the sequence of the nearest neighbors of streaming S at all positions.
Streaming time series and its neighbors
Lmax
Lmin
NNp
X. Sean Wang, GMU © 2002
What’s our problem? Challenges
Nearest neighbor queries are computationally costly. Fast processing is necessary.
Assumptions of this paper Number of patterns is small and all patterns have
already been loaded into the main memory. Pattern set consists of time series with variable lengths. Weighted Euclidian Distance is the similarity measure. Computational cost, i.e. CPU time, is the dominant
factor.
Performance goal Fast processing.
X. Sean Wang, GMU © 2002
Our solution
CQP algorithm Batch process Prediction
X. Sean Wang, GMU © 2002
One batch process
Batch Process (1) Find the distances from each pattern to streaming
series at multiple positions instead of doing so separately.
Fi
p-98
P+1
Fi
p-97
P+2
Fi
p-96
P+3
S
D2(S, Fi) positionp: i
mFS CpCCorrmpS
i
]99[*2][0
99,
2
p+1:
im
FS CpCCorrmpSi
]98[*2][1
98,
2
p+2:
im
FS CpCCorrmpSi
]97[*2][2
97,
2
Ci : constantCCorrS,Fi[d ], d=p-99, …, p-92: Cross correlation function of S and Fi, which can be fast evaluated with an 8-point IFFT .
: incremental calculation
][2 mpS
Fi
p-92
P+7
…p+3:
im
FS CpCCorrmpSi
]96[*2][3
96,
2
p+7:
im
FS CpCCorrmpSi
]92[*2][7
92,
2
………
Fi
pp-99
im
FS
m m mii
mi
CpCCorrmpS
mFmFmpSmpS
mFmpS
i
]99[*2][
][][])99[(2][
]}[])99[({
0
99,
2
0
99
99
0
99
0
22
99
0
2
LFi=100
Un-normalized square distance from S to Fi : D2(S, Fi)
cross correlation of S and Fi with lag of p-99
X. Sean Wang, GMU © 2002
Batch Process (2) Advantages
Deal with pattern series with variable lengths (weighted distance) other than the same length.
Reduce the process time (O (nlog2 n) vs. O( n2 )) as compared with Sequential Scan for n positions: the longer the FFT is, the better the gain will achieve.
Disadvantage Response time suffers: query system must wait some time
intervals in order to form the sub-series for one batch process.
X. Sean Wang, GMU © 2002
Before data values come:
apply the batch process with the predicted time series and get the approximated distances from each pattern to streaming time series at all predicted positions. (no waiting)
Once the data at one of the predicted positions arrives: perform the verification procedure within a portion of patterns
provided by the batch Process and find the actual answer from them. (fast evaluation)
Continuous Query with Prediction
CQP algorithm
X. Sean Wang, GMU © 2002
= 2k )= 2k )= 2k )
CQP procedure
X. Sean Wang, GMU © 2002
1) Batch process provides the approximated distance from each pattern to the streaming time series
CQP for the nearest neighbor search (1)
3) The triangular relationship alwaysholds for each pattern.
iesg time sery: queryin
iesg time sered queryinx: predict
yxDFxDFyDyxDFxD iii ),(),(),(),(),(
2) The prediction error canbe calculated once the actual value arrives.
X. Sean Wang, GMU © 2002
CQP for the nearest neighbor search (2)
Before the actual data comes, quick sort the predicted distancesMinimum upper bound: minUp
the lower bound is less than or equal to minUp
Candidate Patternsthe lower bound is greater than minUp
Filtered Out Patterns
Dis
tan
ce t
o th
e q
uer
yin
g ti
me
seri
es
X. Sean Wang, GMU © 2002
Performance Evaluation
Goal: measure performance gain against the sequential scan.
Data set Streaming time series S is generated with a function
of random walk data. 4 data sets are generated from the sub-series of S
with variable lengths: 300~400, 500~600, 700~800 and 300~800.
Use different prediction error models. SQRT LINEAR SQUARE
X. Sean Wang, GMU © 2002
Error model:Linear trend
Streaming time seriesOne of the
pattern series (length = 780)
Performance Evaluation
X. Sean Wang, GMU © 2002
Result of the nearest neighbor search
Linear error
Prediction is good, while the overhead of batch process is large.
Prediction is worse, while batch process has less overhead.
Best prediction length for each dataset: tradeoff of batch process and prediction errors.
the average CPU cost with CQP
the average CPU cost with SSCAN
SSCAN CPU cost : 1.5 ~ 3.3ms
X. Sean Wang, GMU © 2002
Time Series Case — II
Scenario studied Number of patterns is so large that data
have to be stored in the secondary storage All pattern time series have the same length An index is built on the approximations of
the patterns, which could be obtained with dimensionality reduction techniques
I/O operation, i.e. page access, is the dominant cost
Data come in relatively fast
X. Sean Wang, GMU © 2002
Performance measure: Drop ratio What will happen if data come in too fast?
Some evaluations may take so long that the response time of next one(s) will suffer.
The current and waiting evaluations should be dropped!
Drop ratio Given a time interval, the relative number of
evaluations that are dropped. The limited response strategy
Once the response time of current evaluation exceeds the given response threshold, this evaluation, together with all others in the waiting list, will be dropped.
X. Sean Wang, GMU © 2002
Evaluation with traditional algorithms
Treat each querying series independently and repeatedly use the traditional algorithms. Dimensionality reduction [e.g. E.J. Keogh et al,
2001, C. Faloutsos et al 1994] Nearest neighbor queries [e.g. N.
Roussopoulos et al. 1995] Optimal multi-step k-nearest neighbor search
[e.g. T.Seidl and H.-P. Kriegel 1998].
X. Sean Wang, GMU © 2002
Traditional algorithm: one evaluation
Original Pattern data setIndex
Feature space
Original space
NNf
NN search
Near neighbor
search Candidates
NN?
D(TS, NNf ) NN
TS
X. Sean Wang, GMU © 2002
Direct algorithm Take the result of previous evaluation into account.
The previous result may be the answer to the current evaluation.Referring to previous result has nearly no overhead.
Direct algorithm:
a direct extension of the traditional algorithm by refining the threshold with the previous result.
The more similar the two successive querying time series are, the more significant impact the modified threshold does.
X. Sean Wang, GMU © 2002
Pre-fetch algorithm (1)
Two weaknesses of Direct algorithm If one evaluation is finished quickly and the
waiting list is empty, the query process will wait for the new data.
The cached buffer loaded by the previous evaluations are managed by the operating system and is not optimized for the next evaluation.
Solution Using the idle time before the new data arrives to
bring in the more useful pages.
Pre-fetch algorithm.
X. Sean Wang, GMU © 2002
If no evaluation is waiting1) predict the next incoming value;2) form the predicted querying time series and
issue a NN search with this series;Else
same as Direct algorithmEnd if
Pre-fetch algorithm (2) Predict the next querying time series and issue a query with it.
Only need one-step ahead prediction: error is small. The predicted time series is closer to the next querying time
series as compared with the previous querying time series.more useful pages are pre-fetched.
Only use the idle time before the new data arrives to evaluate the query with the predicted querying series.no overhead incurred.
X. Sean Wang, GMU © 2002
Evaluation cycles
search NN of S at p
S[p+1] comes?
p:= p+1
Yes
waiting
No
Direct /Traditional algorithms
Pre-fetch algorithm
predict S[p+1]No
waiting
Note: all search procedures are subject to Limited Response Strategy.Note: Search procedure with predicted series will be cancelled immediately once the new data arrives.
search NN of at p+1
^
S
X. Sean Wang, GMU © 2002
Experimental results
Pattern dataset: 106 random walk time series, length=128.
Index: R-tree built on the first 6 DWT coefficients. Streaming time series: random walk time series
with length of 1024, No. of query positions = 897.
Stream rate: data come in every 0.5 (0.6, 0.7,…, 4.5, 5) seconds.
Response threshold: 0.5, 1, 1.5 ,…,4.5, 5 seconds.
X. Sean Wang, GMU © 2002
Experimental results: performance comparison
Response time threshold is 2.5 seconds.The smaller the value, the better the performance.
interval = 0.8 seconds
Drop ratio Response time
Direct 24% 0.48 seconds
Pre-fetch 19% 0.33 seconds
interval = 3 seconds Drop ratio Response time
Direct 7.5% 0.55 seconds
Pre-fetch 3% 0.12 seconds
fast stream
slow stream
X. Sean Wang, GMU © 2002
Summary Investigated how to adapt the traditional approaches
in order to efficiently handle the continuous queries.
Future work Optimal drop strategies. All or some pattern series are also streaming data.
Buffer sharing
Result referring
Idle time
Stealing
Tradition
OS, passively No No
Direct OS, passively previous one No
Pre-fetch
OS / pre-fetch, actively
previous / predicted one
Yes
X. Sean Wang, GMU © 2002
Conclusion Continuous queries are emerging as an important
research area. We introduced a new strategy on evaluating
continuous queries on streaming time series. batch process with the predictions and available data verification procedure once the actual value arrives Introduced drop ratio
Performance much better than “naïve” algorithms.
Future work Our strategy: Batch processing and prediction Apply the strategy to other continuous queries. How about some patterns are also streaming time series
Website for papers: http://ise.gmu.edu/~xywang
Backup slides
X. Sean Wang, GMU © 2002
Batch Process (2)How fast is the Cross correlation via FFT?
Length of FFT Ratio* (tDIRECT / tFFT)8 1.2
16 1.7
32 2.7
64 4.5
128 7.6
256 13.2
512 23.3
1024 41.8
*Analysis result: http://www.eptools.com/tn/T0001/PT15.HTM
X. Sean Wang, GMU © 2002
Performance measure: Response time
Response time: the averaged response time for the evaluations that are not dropped.
If drop occurs, the response time must be associated with the drop ratio.
Algorithm Drop ratio Response time
A 20% 40ms
B 60% 40ms
In this case, A is clearly the winner. In more general situations, how to choose the algorithm depends on the real applications.
X. Sean Wang, GMU © 2002
Direct algorithm
Original Pattern data setIndex
Feature space
Original space
TS
NNf
NN search
Near neighbor
search
NN?
min{D(TS, NNf ),
D(TS, NN-1 )}NN
Direct algorithm: one evaluation
NNf
NN-1Candidates