Fast Approximate Correlation for Massive Time Series Data
-
Upload
george-wiley -
Category
Documents
-
view
39 -
download
0
description
Transcript of Fast Approximate Correlation for Massive Time Series Data
![Page 1: Fast Approximate Correlation for Massive Time Series Data](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812de8550346895d93439b/html5/thumbnails/1.jpg)
Abdullah Mueen UC Riverside
Suman Nath Microsoft Research
Jie Liu Microsoft Research
Fast Approximate Correlation for Massive Time Series Data
![Page 2: Fast Approximate Correlation for Massive Time Series Data](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812de8550346895d93439b/html5/thumbnails/2.jpg)
Context
• Data center monitoring apps collect large number of performance counters (eg. CPU utilization)– MS DataGarage: 100-1000 counters/machine/15 sec– 100K machines millions of signals, 1TB/day– Signals are archived on disk, over years
• General goal: Mine the data to discover trends and anomalies, over many months
• Challenge: large number of signals
![Page 3: Fast Approximate Correlation for Massive Time Series Data](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812de8550346895d93439b/html5/thumbnails/3.jpg)
Our Focus: Correlation Queries
• All pair Correlation/Correlation matrix– Is CPU utilization correlated with # of connected users?– Do two machines in the same rack have similar power usage?– Correlation matrix can reveal clusters of servers.– Successive matrices can reveal sudden changes in behavior.
n
1
0
1),(
m
k j
jjk
i
iikji
ss
msscorr
j
i
Given n signals of length m, Find the n X n correlation matrix
(Pearson correlation coefficient)
![Page 4: Fast Approximate Correlation for Massive Time Series Data](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812de8550346895d93439b/html5/thumbnails/4.jpg)
Challenges• Naïve Approach:– Cache signals in batches that fit available memory.– Correlate intra-batch signals.– Correlate inter-batch signals.
• Problems:– Quadratic I/O cost
• 40 Minutes (for n=10K signals and n/32 cache size)
– Quadratic CPU Cost• 93 Minutes (for the above setup)
– Not easily parallelizable
Memory
Disk
BatchSwap Slot
![Page 5: Fast Approximate Correlation for Massive Time Series Data](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812de8550346895d93439b/html5/thumbnails/5.jpg)
Our Approaches: approximations
• Threshold Correlation Matrix– Discard correlations lower than threshold.
• Approximate Threshold Correlation Matrix– Output within ϵ of the true correlation.
• Boolean Threshold Correlation Matrix– Output 0/1 values showing correlated/uncorrelated pairs.
Exact Matrix Approximate Threshold Correlation Matrix
(ϵ =0.04)
Boolean Threshold Correlation Matrix
(T=0.5)
Threshold Correlation Matrix
(T=0.5)
Equally useful in practice. Up to 18x faster.
![Page 6: Fast Approximate Correlation for Massive Time Series Data](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812de8550346895d93439b/html5/thumbnails/6.jpg)
Outline• Reducing I/O Cost– Threshold Correlation Matrix
• Reducing CPU Cost by– ϵ - Approximate Correlation Matrix– Boolean Correlation Matrix
• Extensions• Evaluation
![Page 7: Fast Approximate Correlation for Massive Time Series Data](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812de8550346895d93439b/html5/thumbnails/7.jpg)
Reducing I/O Cost
• Challenge: The order in which signals are brought to memory is important.– Correlating signals across batches is expensive.
• Observation: Uncorrelated signals can often be determined (pruned) cheaply using bounds.
Disk
Memory
BatchInter Batch
Intra Batch
![Page 8: Fast Approximate Correlation for Massive Time Series Data](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812de8550346895d93439b/html5/thumbnails/8.jpg)
Our Approach
• Step 1: Identify likely correlated signal pairs using Discrete Fourier Transform (DFT).
• Step 2: Batch signals such that• Likely correlated signals are batched together.• Uncorrelated signals are batched separately.
(reduces expensive inter-batch computation)
![Page 9: Fast Approximate Correlation for Massive Time Series Data](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812de8550346895d93439b/html5/thumbnails/9.jpg)
Step 1: Pruning using DFT
• Correlation is related to Euclidian distance.– If and then correlation
• DFT preserves Euclidian distance.• Any prefix of DFT can give a lower bound.– Use a small prefix to identify if a signal pair is
likely correlated or definitely uncorrelated– Smooth signals need very few DFT coefficients
)ˆ,ˆ(2
11),( 2 yxdyxcorr
x
xii
xx
ˆy
yii
yy
ˆ
Pruning Matrix
(Agrawal et al., 1993; Zhu et al.,2002)
![Page 10: Fast Approximate Correlation for Massive Time Series Data](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812de8550346895d93439b/html5/thumbnails/10.jpg)
Step 2: Min-cut Partitioning
Merging smaller batches
Likely Correlated Graph
1
2
3 4
56
7
8
910
11
12
Cut
Pruning matrix
1
2
3 4
56
7
8
910
11
12
Recursive Bi-partitioning
The node capacitated graph partitioning problem is NP Complete.
(Ferreira et al.,1998)
(Fiduccia and Mattheyses, 1982; Wang et al., 2000)
![Page 11: Fast Approximate Correlation for Massive Time Series Data](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812de8550346895d93439b/html5/thumbnails/11.jpg)
Example
1
2
3
8
45
5
6
7
4
1 2
3
4
56
7
8
6 11 I/O
Pruning by DFT
1 2
3
456
7
8
1 2
3
4
56
7
8
1
6
7
8
5
2
3
4
5
9 I/O
Min-cut Partitioning
1
2
3
4
56
78
5
6
7
8
1 2
3
4
56
7
8
12 I/O
Cached Batch
Comparisons in this order
• 8 Signals• 5 signals can fit in the memory• Maximum Batch size is 4• 1 swap slotBatch
Loads
![Page 12: Fast Approximate Correlation for Massive Time Series Data](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812de8550346895d93439b/html5/thumbnails/12.jpg)
Outline• Reducing I/O Cost– Threshold Correlation Matrix
• Reducing CPU Cost by– ϵ - Approximate Correlation Matrix (1)– Boolean Correlation Matrix (2)
• Extensions• Evaluation
![Page 13: Fast Approximate Correlation for Massive Time Series Data](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812de8550346895d93439b/html5/thumbnails/13.jpg)
ϵ - Approximate Correlation(1)
• Result: given ϵ, compute k • Smooth Signal k is small for small ϵ
0 200 400 600 800 1000-2-1012
……
0 200 400 600 800 1000-4-2024
……
DFT
DFT
Compute correlationin frequency domain
Length m
Length m
Length k
Reduce scanning cost from O(m) to O(k)
General Approach : Unlike previous work (StatStream: Zhu et. al 2002)Exact correlation ci,j
Approx corr. ci,j ± ϵ
![Page 14: Fast Approximate Correlation for Massive Time Series Data](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812de8550346895d93439b/html5/thumbnails/14.jpg)
Example: ϵ=0.1
1
0
21
0
2 |ˆˆ|2
11)ˆˆ(
2
11),(
m
iii
m
iii YXyxyxcorr
1.0),(|ˆˆ|11.0),(1
0
2
yxcorrYXyxcorrk
iii
k = max(kx,ky) = 18 For most signals k << m
0 200 400 600 800 1024
-2
0
2 Signal x
0 200 400 600 800 1024-2
0
2 Signal y
Prefix length for time domain1014 1017
0 100 250
0.1
0.3
0.5
DFT X
0 100 250
0.1
0.3
0.5
DFT Y
kx= 14 ky= 18
Prefix Length
![Page 15: Fast Approximate Correlation for Massive Time Series Data](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812de8550346895d93439b/html5/thumbnails/15.jpg)
Boolean Correlation(2)
• Corr. Threshold T distance threshold θ • UB(x,y) ≤ θ then corr(x,y)≥T• LB(x,y) > θ then corr(x,y)<T• Otherwise, compute correlation
• Result: Compute bounds using triangular inequality and use dynamic programming.
Idea: Use distance bounds instead of exact distance
O(1) point arithmetic O(m) vector arithmetic
![Page 16: Fast Approximate Correlation for Massive Time Series Data](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812de8550346895d93439b/html5/thumbnails/16.jpg)
Outline• Reducing I/O Cost– Threshold Correlation Matrix
• Reducing CPU Cost by– ϵ - Approximate Correlation Matrix– Boolean Correlation Matrix
• Extensions• Evaluation
![Page 17: Fast Approximate Correlation for Massive Time Series Data](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812de8550346895d93439b/html5/thumbnails/17.jpg)
Extensions
• Negative Correlation– For every pair (x,y) also consider (x,-y).– No need to compute DFT for -y. DFT(-y) = -DFT(y)
• Lagged Correlation– Maximum correlation for a set of possible lags [Sakurai’05].
– Result: DFT of prefix/suffix of x from prefix of DFT(x).
Prefix(DFT( ))x
Signal x
DFT(Prefix( ))x
Lag = 4
![Page 18: Fast Approximate Correlation for Massive Time Series Data](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812de8550346895d93439b/html5/thumbnails/18.jpg)
I/O Speedup
• Almost linear growth in I/O cost as we increase (data size : cache size)• Up to 3.5X speedup over random partitioning.
• We used 4 datasets– DC1: #TCP connections established in a server.– DC2: Processor Utilizations of a server.– Chlorine: The Chlorine level in water distribution network.– RW: Synthetic random walks modeling the behavior of a stock price.
![Page 19: Fast Approximate Correlation for Massive Time Series Data](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812de8550346895d93439b/html5/thumbnails/19.jpg)
CPU SpeedupSp
eedu
p (A
T/A ϵ)
Spee
dup
(A/A
T)
Spee
dup
(AT/
A ϵ)
• Pruning in frequency, correlation in time domain [Zhu, VLDB02]
• Approximate correlation in frequency domain.
Up to 18X speedup over time domain correlations
![Page 20: Fast Approximate Correlation for Massive Time Series Data](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812de8550346895d93439b/html5/thumbnails/20.jpg)
Related Work• Exact correlations– Statstream (Zhu et. al 2002) uses DFT to prune
uncorrelated pairs– Exact correlation in time domain
• Approximate correlations– HierarchyScan (Li et. al 1996)– False negatives without error bound
• Above works do not consider I/O optimization
![Page 21: Fast Approximate Correlation for Massive Time Series Data](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812de8550346895d93439b/html5/thumbnails/21.jpg)
Conclusion
• Bounded approximation techniques for fast correlation queries• Threshold Correlation Matrix, up to 3.5X faster• ϵ-Approximate Correlation Matrix, up to 18X
faster• Boolean Correlation Matrix, up to 11X faster
• Based on computational shortcuts in DFT• Can be extended to negative and lagged
correlation with very low error
![Page 22: Fast Approximate Correlation for Massive Time Series Data](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812de8550346895d93439b/html5/thumbnails/22.jpg)
Thank You
![Page 23: Fast Approximate Correlation for Massive Time Series Data](https://reader035.fdocuments.in/reader035/viewer/2022062321/56812de8550346895d93439b/html5/thumbnails/23.jpg)
ϵ - Approximate Correlation(1)
• Result: given ϵ, compute k • Smooth Signal k is small for small ϵ
0 200 400 600 800 1000-2-1012
……
0 200 400 600 800 1000-4-2024
……
DFT
DFT
Compute correlationin frequency domain
Length m
Length m
Length k
Reduce scanning cost from O(m) to O(k)
General Approach : Unlike previous work
Total energy, = 1.0
Exact correlation ci,j
1
0
2|ˆ|m
iiX
Selected energy, = (1-ϵ/2)
Approx corr. ci,j ± ϵ
1
0
2|ˆ|k
iiX