High Performance Correlation Techniques For Time Series
-
Upload
stockton-gavyn -
Category
Documents
-
view
40 -
download
8
description
Transcript of High Performance Correlation Techniques For Time Series
![Page 1: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/1.jpg)
1
High Performance Correlation Techniques For Time Series
Xiaojian ZhaoDepartment of Computer Science
Courant Institute of Mathematical SciencesNew York university
25 Oct. 2004
![Page 2: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/2.jpg)
2
RoadmapSection 1: Introduction
Motivation Problem Statement
Section 2 : Background GEMINI framework Random Projection Grid Structure Some Definitions Naive method and Yunyue’s Approach
Section 3 : Sketch based StatStream Efficient Sketch Computation Sketch technique as a filter Parameter selection Grid structure System Integration
Section 4 : Empirical StudySection 5 : Future WorkSection 6 : Conclusion
![Page 3: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/3.jpg)
3
Section 1: Introduction
![Page 4: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/4.jpg)
4
Motivation Stock prices streams
The New York Stock Exchange (NYSE) 50,000 securities (streams); 100,000 ticks (trade and quote)
Pairs Trading, a.k.a. Correlation Trading Query:“which pairs of stocks were correlated with a value of over 0.9
for the last three hours?”XYZ and ABC have been correlated with a correlation of 0.95 for the last three hours.Now XYZ and ABC become less correlated as XYZ goes up and ABC goes down.They should converge back later.I will sell XYZ and buy ABC …
![Page 5: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/5.jpg)
5
Online Detection of High Correlation
Correlated!
Correlated!
![Page 6: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/6.jpg)
6
Why speed is important
As processors speed up, algorithmic efficiency no longer matters … one might think.
True if problem sizes stay same but they don’t. As processors speed up, sensors improve --
satellites spewing out a terabyte a day, magnetic resonance imagers give higher resolution images, etc.
![Page 7: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/7.jpg)
7
Problem Statement
Detect and report the correlation rapidly and accurately
Expand the algorithm into a general engine Apply them in many practical application
domains
![Page 8: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/8.jpg)
8
Big Picture
Random Projection
time series 1
time series 2
time series 3
…
time series n
…
sketch 1
sketch 2
…
sketch n
…
Grid structur
e
Correlatedpairs
![Page 9: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/9.jpg)
9
Section 2: Background
![Page 10: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/10.jpg)
10
GEMINI framework*
* Faloutsos, C., Ranganathan, M. & Manolopoulos, Y. (1994). Fast subsequence matching in time-series databases. In proceedings of the ACM SIGMOD Int'l Conference on Management of Data. Minneapolis, MN, May 25-27. pp 419-429.
DFT, DWT, etc
![Page 11: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/11.jpg)
11
Goals of GEMINI framework
High performance
Operations on synopses will save time such as distance computation
Guarantee no false negative
Feature Space shrinks the original distances in the raw data space
.
![Page 12: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/12.jpg)
12
Random Projection: Intuition
You are walking in a sparse forest and you are lost. You have an outdated cell phone without a GPS. You want to know if you are close to your friend. You identify yourself at 100 meters from the pointy rock
and 200 meters from the giant oak etc. If your friend is at similar distances from several of these
landmarks, you might be close to one another. The sketches are the set of distances to landmarks.
![Page 13: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/13.jpg)
13
How to make Random Projection*
Sketch pool: A list of random vectors drawn from stable distribution (like the landmarks)
Project the time series into the space spanned by these random vectors
The Euclidean distance (correlation) between time series is approximated by the distance between their sketches with a probabilistic guarantee.
•W.B.Johnson and J.Lindenstrauss. “Extensions of Lipshitz mapping into hilbert space”. Contemp. Math.,26:189-206,1984
![Page 14: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/14.jpg)
14
),...,3,2,1(1 1111 wrrrrR
),...,3,2,1(2 2222 wrrrrR ),...,3,2,1(3 3333 wrrrrR ),...,3,2,1(4 4444 wrrrrR
)4,3,2,1( xskxskxskxsk
)4,3,2,1( yskyskyskysk
inner product
random vector sketchesraw time series
Random Projection
),...,,( 321 wxxxxx
),...,,( 321 wyyyyy
X’ current position
Y’ current position
Rocks, buildings…
Y’ relative distances
X’ relative distances
![Page 15: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/15.jpg)
15
Sketch Guarantees
Note: Sketches do not provide approximations of individual time series window but help make comparisons.
Johnson-Lindenstrauss Lemma: For any and any integer n, let k be a positive integer such that
Then for any set V of n points in , there is a map such that for all
Further this map can be found in randomized polynomial time
10
nk ln)3/2/(4 132 dR kd RRf :
Vvu ,222 ||||)1(||)()(||||||)1( vuvfufvu
![Page 16: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/16.jpg)
16
Sketches : Random Projection
Why we use sketches or random projections?
To reduce the dimensionality!
For example:
The original time series x is of the length 256, we may represent it with a sketch vector of length 30.
First step to removing “the curse of dimensionality”
![Page 17: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/17.jpg)
17
Achliptas’s lemma Dimitris Achliptas proved that
*Idea from Dimitris Achlioptas, “Database-friendly Random Projections”, Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Let P be an arbitrary set of n points in , represented as an matrix A. Given , let
For integer , let R be a random matrix with R(i;j)= , where { } are independent random variables from either one of the following two probability distributions shown in next slide:
dR dn0,
nk log32
24320
0kk kd ijrijr
![Page 18: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/18.jpg)
18
Achliptas’s lemma
*Idea from Dimitris Achlioptas, “Database-friendly Random Projections”, Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Let
Let map the row of A to the row of E. With a probability at least , for all
ARk
E1
kd RRf : thi thiPvu ,
222 ||||)1(||)()(||||||)1( vuvfufvu
2/11
2/11
yprobabilitwith
yprobabilitwithrij
6/11
3/20
6/11
yprobabilitwith
yprobabilitwith
yprobabilitwith
rijor
n1
![Page 19: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/19.jpg)
19
Definition: Sketch Distance
)/))(...)()(((
||||22
222
11 kyskxskyskxskyskxsksqrtdsk
yskxskdsk
kk
Note: DFT, DWT distance are analogous. For those measures, the difference between the original vectors is approximated by the difference between the first Fourier/Wavelet coefficients of those vectors.
![Page 20: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/20.jpg)
20
Empirical Study : Sketch Approximation
Return Data
0
5
10
15
20
25
30
1 66
131
196
261
326
391
456
521
586
651
716
781
846
911
976
Data Points
Dis
tan
ce
sketch
dist
![Page 21: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/21.jpg)
21
Empirical Study: sketch distance/real distance
Factor distribution
0%
1%
2%
3%
4%
5%
Factor(Real distance/Sketch distance)
Per
cent
age
of d
ista
nce
number
Factor distribution
0%
1%
2%
3%
4%
5%
6%
7%
1.25
1.20
1.16
1.12
1.09
1.05
1.02
0.99
0.96
0.93
0.91
0.88
Factor(Real distance/Sketch distance)
Per
cent
age
of d
ista
nce
number
Sketch=30
Sketch=80
Factor distribution
0%
2%
4%
6%
8%
10%
12%
1.19
1.16
1.14
1.11
1.09
1.06
1.04
1.02
1.00
0.98
0.96
0.94
0.93
Factor(Real distance/Sketch distance)
Per
cent
age
of d
ista
nce
number
Sketch=1000
![Page 22: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/22.jpg)
22
Grid Structure
),...,( 21 kxxxx
![Page 23: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/23.jpg)
23
Correlation and Distance
There is relationship between Euclidean distance and Pearson correlation Normalization
dist2=2(1- correlation)
)var(
)(
sw
swii X
XavgXX
![Page 24: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/24.jpg)
24
How to compute the correlation efficiently?
Goal: To find the most highly correlated stream pairs over sliding windows
Naive method Statstream method Our method
![Page 25: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/25.jpg)
25
Naïve Approach
Space and time cost Space O(N) and time O(N2sw)
N : number of streams
sw : size of sliding window.
Let’s see Statstream approach
![Page 26: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/26.jpg)
26
Definitions: Sliding window and Basic window
……Stock 1
Stock 2
Stock 3
Stock n
Sliding window
Time axis
Sliding window size=8
Basic window size=2
Basic window Time point
![Page 27: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/27.jpg)
27
StatStream Idea
Use Discrete Fourier Transform(DFT) to approximate correlation as in the GEMINI approach discussed earlier.
Every two minutes (“basic window size”), update the DFT for each time series over the last hour (“sliding window size”)
Use a grid structure to filter out unlikely pairs
![Page 28: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/28.jpg)
28
StatStream: Stream synoptic data structure
Sliding window
Basic window digests:
sum
DFT coefs
Basic window
Time point
Basic window digests:
sum
DFT coefs
![Page 29: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/29.jpg)
29
Section 3: Sketch based StatStream
![Page 30: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/30.jpg)
30
Problem not yet solved
DFT approximates the price-like data type very well. Gives a poor approximation for returns(today’s price – yesterday’s price)/yesterday’s price.
Return is more like white noise which contains all frequency components.
DFT uses the first n (e.g. 10) coefficients in approximating data, which is insufficient in the case of white noise.
![Page 31: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/31.jpg)
31
Random Walk
0
0.2
0.4
0.6
0.8
1
1.2
1 6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
10
1
The number of coefficients
Rat
io o
ver
tota
l ene
rgy
ratio
White Noise
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1 6
11
16
21
26
31
36
41
46
51
56
61
66
71
76
81
86
91
96
10
1
The number of coefficients
Rat
io o
ver
tota
l ene
rgy
ratio
![Page 32: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/32.jpg)
32
Big Picture Revisited
Random Projection
time series 1
time series 2
time series 3
…
time series n
…
sketch 1
sketch 2
…
sketch n
…
Grid structur
e
Correlatedpairs
Random Projection: inner product between Data Vector and random vector
![Page 33: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/33.jpg)
33
How to compute the sketch efficiently
We will not compute the inner product at each data point because the computation is expensive.A new strategy, in joint work with Richard Cole, is used to compute the sketch.Here the random variable will be drawn from:
2/1
2/1
1
1
yprobabilitwith
yprobabilitwithrij
![Page 34: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/34.jpg)
34
How to construct the random vector:Given time series , compute its sketch for a window of size sw=12.
Partition to smaller basic windows of size bw = 4.
The random vector within a basic window is R and a control vector b
is used to determine which basic window will be multiplied with –1 or 1 (Why? Wait…)
A final complete random vector may look like:
),,,( 4321 rrrrR 1/1 ir),,( 321 bbbb
),,( 21 xxX
b
(1 1 -1 1; -1 -1 1 -1; 1 1 -1 1) Here bw=(1 1 -1 1) b=(1 -1 1)
![Page 35: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/35.jpg)
35
Naive algorithm and hope for improvement
There is redundancy in the second dot product given the first one. We will eliminate the repeated computation to save time
dot product
r=(1 1 -1 1; -1 -1 1 -1; 1 1 -1 1) x=(x1 x2 x3 x4; x5 x6 x7 x8; x9 x10 x11 x12)
xsk=r*x= x1+x2-x3+x4-x5-x6+x7-x8+x9+x10-x11+x12
With new data point arrival, such operations will be done again
r=(1 1 -1 1; -1 -1 1 -1; 1 1 -1 1) x’=(x5 x6 x7 x8 ; x9 x10 x11 x12; x13 x14 x15 x16)
xsk=r*x’= x5+x6-x7+x8-x9-x10+x11+x12+x13+x14+x15- x16*
![Page 36: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/36.jpg)
36
Our algorithm (Pointwise version)
Convolve with corresponding after padding with |bw| zeros.
bwX bwR
Animation shows convolution in action:
conv1:(1 1 -1 1 0 0 0 0) (x1,x2,x3,x4)conv2:(1 1 -1 1 0 0 0 0) (x5,x6,x7,x8)conv3:(1 1 -1 1 0 0 0 0) (x9,x10,x11,x12)
1 1 -1 1 0 0 0 0
x1 x2 x3 x4
x4
x4+x3
-x4+x3+x2
x4-x3+x2+x1
x3-x2+x1
x2-x1
x1
x1 x2 x3 x4x1 x2 x3 x4x1 x2 x3 x4x1 x2 x3 x4x1 x2 x3 x4x1 x2 x3 x4
![Page 37: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/37.jpg)
37
Our algorithm: example
+
First Convolution Second Convolution Third Convolution
x4
x4+x3
x2+x3-x4
x1+x2-x3+x4
x1-x2+x3
x2-x1
x1
x8
x8+x7
x6+x7-x8
x5+x6-x7+x8
x5-x6+x7
x6-x5
x5
x12
x12-x11
x10+x11-x12
x9+x10-x11+x12
x9-x10+x11
x10-x9
x9
+
![Page 38: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/38.jpg)
38
Our algorithm: example
(Sk1 Sk5 Sk9)*(b1 b2 b3) * is inner product
sk2=(x2+x3-x4) + (x5)sk6=(x6+x7-x8) + (x9)sk10=(x10+x11-x12) + (x13)Then sum up and we havexsk2=(x2+x3-x4+x5)-(x6+x7-x8+x9)+(x9+x10-x11+x12)b=( 1 -1 1)
sk1=(x1+x2-x3+x4)sk5=(x5+x6-x7+x8) sk9=(x9+x10-x11+x12)xsk1= (x1+x2-x3+x4)-(x5+x6-x7+x8)+(x9+x10-x11+x12)b= ( 1 -1 1)
First sliding window
Second sliding window
![Page 39: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/39.jpg)
39
Our algorithm
The projection of a sliding window is decomposed into operations over basic windows
Each basic window is convolved with each random vector only once
We may provide the sketches incrementally starting from each data point.
There is no redundancy.
![Page 40: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/40.jpg)
40
Jump by a basic window (basic window version)
Or if time series are highly correlated between two consecutive data points, we may compute the sketch every other basic window.
That is, we update the sketch for each time series only when data of a complete basic window arrive.
1 1 –1 1
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12
1 1 –1 1
1 1 –1 1
x1+x2-x3+x4 x5+x6-x7+x8 x9+x10-x11+x12
![Page 41: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/41.jpg)
41
Online Version
We take the basic window version for instance Review: To have the same baseline we normalize
the time series within each siding window. Challenge: The normalization of the time series
change over each basic window
![Page 42: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/42.jpg)
42
Online Version
Its incremental computation nature results in a update of the average and variance whenever a new basic window enters
Do we have to compute the normalization and thus the sketch whenever a new basic window enters?
Of course not. Otherwise our algorithm will degrade into the trivial computation
![Page 43: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/43.jpg)
43
Online Version
Then how? After mathematical manipulation, we claim that we only need store and maintain the following quantities
Sum of the whole sliding window
Sum of the square of each data in a sliding window
Sum of the whole basic window
Sum of the square of each data in a basic window
Dot Product of random vector with each basic window
1
0
sw
iX
1
0
2sw
iX
1
0
2bw
iX
1
0
bw
iX
RX bw
![Page 44: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/44.jpg)
44
Performance comparison
Naïve algorithm For each datum and random vector O(|sw|) integer additions Pointwise version Asymptotically for each datum and random vector (1) O(|sw|/|bw|) integer additions (2) O(log |bw|) floating point operations (use FFT in computing
convolutions) Basic window version Asymptotically for each basic window and random vector (1) O(|sw|/|bw|) integer additions (2) O(|bw|) floating point operations
![Page 45: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/45.jpg)
45
Sketch distance filter quality
We may use the sketch distance to filter the unlikely data pairs
How accurate is it? How is it compared to DFT and DWT distance
in terms of the approximation ability?
![Page 46: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/46.jpg)
46
Empirical Study: Sketch sketch compared to DFT and DWT distance
Data length=256 DFT: the first 14 DFT coefficients are used in the
distance computation, DWT: db2 wavelet is used with coefficient
size=16 Sketch: the random vector number is 64
![Page 47: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/47.jpg)
47
Empirical Comparison: DFT, DWT and Sketch
Price Data
0
10
20
30
40
50
1 71
141
211
281
351
421
491
561
631
701
771
841
911
981
Data Points
Dis
tan
ce sketch
dist
dwt
dft
![Page 48: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/48.jpg)
48
Empirical Comparison : DFT, DWT and Sketch
Return Data
0
5
10
15
20
25
30
1 92 183 274 365 456 547 638 729 820 911
Data Points
Dis
tan
ce
dft
dwt
sketch
dist
![Page 49: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/49.jpg)
49
Use the sketch distance as a filter
We may compute the sketch distance:
c could be 1.2 or larger to reduce the number of false negatives.
Finally any possible data point will be double checked with the raw data.
)/))(...)()(((
||||22
222
11 kyskxskyskxskyskxsksqrtdsk
yskxskdsk
kk
distcyskixski *||
![Page 50: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/50.jpg)
50
Use the sketch distance as a filter
But we will not use it, why? Expensive. Since we still have to do the pairwise
comparison between each pair of stocks which is , k is the size of the sketches
)( 2knO
![Page 51: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/51.jpg)
51
Sketch unit distance
)8,7,6,5,4,3,2,1( xskxskxskxskxskxskxskxskxsk )8,7,6,5,4,3,2,1( yskyskyskyskyskyskyskyskysk
|11| yskxsk |22| yskxsk |33| yskxsk |44| yskxsk |55| yskxsk |66| yskxsk |77| yskxsk |88| yskxsk
Given sketches:
If f distance chunks have we may say where: f: 30%, 40%, 50%, 60% … c: 0.8, 0.9, 1.1…
distcyskixski *|| distyx ||||
We have
![Page 52: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/52.jpg)
52
Further: sketch groups
||||
,,, 321
gigigi
ggg
yskxskdsk
where
dskdskdsk
)4/))()(
)()(((2
442
33
222
2111
yskxskyskxsk
yskxskyskxsksqrtdskg
...)8,7,6,5,4,3,2,1( xskxskxskxskxskxskxskxskxsk ...)8,7,6,5,4,3,2,1( yskyskyskyskyskyskyskyskysk
We may compute the sketch group:
For example
If f sketch groups have we may say where: f: 30%, 40%, 50%, 60% c: 0.8, 0.9, 1.1
distcdskdsk gigi *|| distyx ||||
Grid Structure
![Page 53: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/53.jpg)
53
Optimization in Parameter Space
Next, how to choose the parameters g, c, f, N?
N: total number of the sketchesg: group sizec: the factor of distancef: the fraction of groups which are necessary to claim that two time series are close enough
![Page 54: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/54.jpg)
54
Optimization in Parameter Space
Essentially, we will prepare several groups of good parameter candidates and choose the best one to be applied to the practical data
But, how to select the good candidates? Combinatorial Design (CD) Bootstrapping
![Page 55: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/55.jpg)
55
Combinatorial Design
The pair-wise combinations of all the parametersInformally: Each parameter value will see each value of
other parameters in some parameter group.
P: P1, P2, P3
Q: Q1, Q2, Q3, Q4
R: R1, R2
Combinations: #P*#Q*#R=24 groups
Combinatorial Design:12 groups*
*http://www.cs.nyu.edu/cs/faculty/shasha/papers/comb.html
![Page 56: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/56.jpg)
56
Combinatorial Design
Much smaller test space compared to that of all parameter combinations
We will further reduce the test space by taking advantage of continuity of recall and precision in parameter space.
0.1
0.4
0.7 1
1.3
0.1
0.60
0.2
0.4
0.6
0.8
1
Precision
f
c
Precision with different parameter groups
0.8-1
0.6-0.8
0.4-0.6
0.2-0.4
0-0.2 0.1
0.4
0.7
1
1.3 0.1
0.60
0.2
0.4
0.6
0.8
1
Recall
fc
Recall with different parameter groups
0.8-1
0.6-0.8
0.4-0.6
0.2-0.4
0-0.2
![Page 57: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/57.jpg)
57
Combinatorial Design
we will employ the coarse to fine strategy
N: 30, 36, 40, 60g: 1, 2, 3 c: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1, 1.1, 1.2, 1.3 f: 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1When the good parameters are located, its local neighbors will be searched further for better solutions
![Page 58: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/58.jpg)
58
Bootstrapping Choose parameters with a stable performance in
both sample data and real data A sample set with 2,000,000 pairs Among it, choose with replacement 20,000 sample
100 times. Compute the recall and precision each time
![Page 59: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/59.jpg)
59
Bootstrapping 100 recalls and precisions Compute mean and std of recalls and precisions Criterion of good parameters
Mean(recall)-std(recall)>Threshold(recall)
Mean(precision)-std(precision)>Threshold(precision)
If there are no such parameters, enlarge the replacement sample size
![Page 60: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/60.jpg)
60
Parameter Selection
N f c mean_rec std_rec mean_prec std_prec60 0.4 0.45 1 0 0.042 0.007460 0.4 0.46 1 0 0.038 0.006960 0.4 0.47 0.998 0.0054 0.035 0.006560 0.5 0.55 1 0 0.056 0.009360 0.5 0.56 1 0 0.052 0.0088
![Page 61: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/61.jpg)
61
Preferred data distributions
The distribution of the data affects the performance of our algorithm (Recall price and return)
The ideal data distribution:
Generally, the less human intervenes, the better The “green” data give much better results.
CX
X
dQXanddQX
1#
2#
2||2||||1||
Where, C is a small constant
![Page 62: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/62.jpg)
62
Empirical Study: Various data types
Cstr: Continuous stirred tank reactor
Fortal_ecg: Cutaneous potential recordings of a pregnant woman
Steamgen: Model of a steam generator at Abbott Power Plant in Champaign IL
Winding: Data from a test setup of an industrial winding process
Evaporator: Data from an industrial evaporator
Wind: Daily average wind speeds for 1961-1978 at 12 synoptic meteorological stations in the Republic of Ireland
Spot_exrates: The spot foreign currency exchange rates
EEG: Electroencepholgram
![Page 63: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/63.jpg)
63
Empirical Study: Data distributionPrice Distance Distribution
010002000300040005000600070008000
1 4 7 10 13 16 19 22 25 28 31
Di stance
Num
ber
of th
e D
ista
nce
num
Return Distance Distribution
010000
2000030000
4000050000
6000070000
18 19 20 21 22 23 24 25
Di stance
Num
ber
of th
e D
ista
nce
num
cst r Di stance Di st r i but i on
0
50
100
150
200
4 6 8 10 12 14 16 18 20 22 24 26 28 30
Di stance
Numb
er o
f th
e Di
stan
ce
num
![Page 64: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/64.jpg)
64
Grid Structure
Critical: The largest value Useful in the normalization to fit in the grid
structure Our small lemma:
)()( WindowSlidingSizeofsketchunitMax
![Page 65: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/65.jpg)
65
Grid Structure High correlation => closeness in the vector space To avoid checking all pairs We can use a grid structure and look in the
neighborhood, this will return a super set of highly correlated pairs.
The data labeled as “potential” will be double checked using the raw data vectors.
The pruning power: how many percentage of data are filtered as impossible to be close.
![Page 66: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/66.jpg)
66
Inner product with random vectors r1,r2,r3,r4,r5,r6
),,,,,( 654321 xskxskxskxskxskxsk
),,,,,( 654321 yskyskyskyskyskysk
),,,,,( 654321 zskzskzskzskzskzsk
X Y Z
![Page 67: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/67.jpg)
67
),( 21 xskxsk
),( 21 yskysk
),( 21 zskzsk
),( 43 xskxsk
),( 43 yskysk
),( 43 zskzsk
),( 65 xskxsk
),( 65 yskysk
),( 65 zskzsk
Grid structure
![Page 68: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/68.jpg)
68
System Integration
By combining the sketch scheme with the grid structure, we can
Reduce dimensionality
Eliminate unnecessary pair comparisons
The performance can be improved substantially
![Page 69: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/69.jpg)
69
Empirical Study: SpeedComparison of processing time
0
100
200
300
400
500
600
200 400 600 800 1000 1200 1400 1600 1800 2000 2200
Number of Streams
Wal
l C
lock
Tim
e (s
eco
nd
s)
sketch_random
sketch_randomwalk
exact
Sliding window=3616, basic window=32 and sketch size=60
![Page 70: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/70.jpg)
70
Empirical Study: Breakdown
Processing time of randomwalk data
020406080
100120140160
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
Number of Streams
Wal
l C
lock
Tim
e (s
eco
nd
s)
Detecting Correlation
Updating Sketches
![Page 71: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/71.jpg)
71
Empirical Study: Breakdown
Processing time of random data
0
5
10
15
20
25
30
35
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
5500
Number of Streams
Wall
Clo
ck T
ime
(seco
nd
s)
Detecting Correlation
Updating Sketches
![Page 72: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/72.jpg)
72
The Pruning Power of the Grid Structure
Processing Time
0
500
1000
1500
2000
2500
3000
Data Type and Size
Wal
l Clo
ck T
ime(
seco
nd)
grid2
grid3
dft
scan
![Page 73: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/73.jpg)
73
Visualization
![Page 74: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/74.jpg)
74
Other applications
Cointegration Test Matching Pursuit Anomaly Detection
![Page 75: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/75.jpg)
75
Cointegration Test
Make stationary by the linear combination of several non-stationary time series.
Model long run characteristic as opposed to the correlation
Statstream may be applied to test the stationary condition of cointegration
![Page 76: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/76.jpg)
76
Decompose signal into a group of non- orthogonal sub-components
Test the correlation among atoms in a dictionary.
Expedite the component selection
Matching Pursuit
![Page 77: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/77.jpg)
77
Anomaly Detection
Measure the relative distance of each point from its nearest neighbors
Statstream may serve as a monitor by reporting those points far from any normal points
![Page 78: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/78.jpg)
78
Conclusion
Introduction GEMINI Framework Random Projection Statstream Review Efficient Sketch Computation Parameter Selection Grid Structure System Integration Empirical Study Future work
![Page 79: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/79.jpg)
79
Thanks a lot!
![Page 80: High Performance Correlation Techniques For Time Series](https://reader035.fdocuments.in/reader035/viewer/2022081512/56812d15550346895d91fb96/html5/thumbnails/80.jpg)
80
Recall and Precision
Recall=C/A Precision=C/B
A B
C
A: Query ball
B: Returned result
C: Intersection