Post on 20-Jul-2015
Mining Group Correlations over Data Streams
2011/12/02
Publication/ICCSE 2011
Presenter/Yuan-Chung Chang
Outline
Introduction Related Work Fundermental Theory MGDS algorithms Experiment Conclusions
2
Introduction
Mining correlation over steams attracts a lot of attentions recently.
Group correlation analysis over data streams is relatively few.
Existing literatures are mainly focused on a single time window, with large space and time complexity.
This paper proposes an online canonical correlation analysis algorithm called MGDS (Mining Group Data Streams). the MGDS algorithm dynamically maintains a few
statistics from raw data to calculate correlation. 3
Related Work
The correlation analysis of multidimensional or multiple data streams StreamSVD algorithm (2003)
StreamSVD samples the observed values depend on low rank approximation theory, and uses SVD theory to analyze the correlation of streams.
StreamCCA algorithm (2006) StreamCCA applied canonical correlation analysis(CCA) of
the classical statistical theory to the field of data streams.
StreamSVD and StreamCCA both need to keep the whole historical values of streams and can’t get the correlation of changing time range.
4
Fundermental Theory (1/7)
Definition Correlation coefficients
xji、 yjk respectively represent the j th values of the i th, k th data streams Xi,Yk
5
Fundermental Theory (2/7)
Definition Multidimensional data stream
Multidimensional data streams can be viewed as a mapping of one dimensional data streams.
e.g. at time j , the values of N streams is xj1, xj2, …, xji, …, xjN , then value of corresponding multidimensional data stream is [xj1, xj2, …, xji, …, xjN ] .
6
Fundermental Theory (3/7)
Definition Base window
Suppose there are N streams and the current time is t , in the time window w , the observed values [xti … x(t+w-1)i] (1 ≤ i ≤ N) of N streams consist of base window.
Correlation query window A set of successive base window
7
Fundermental Theory (4/7)
Principle of CCA Correlation analysis is a way of measuring the linear
relationship between two sets of data streams. Canonical variable Ui generated by X represents most of
information of X Canonical variable Vi generated by Y represents most of
information of Y
aiT and bi
T, which represent the weight of different dimensions of Ui and Vi in the correlation, are linear transformation.
8
Fundermental Theory (5/7)
Principle of CCA Theorem 1
Suppose p ≤ q and let the random vectors Xp×1 and Yq×1 have Cov(X)=Σ11(p×p), Cov(Y)=Σ22(q×q), Cov(X,Y)=Σ12(p×q), where Σ is full rank.
For coefficient vectors ap×1 and bq×1, form the linear combinations U=aTX and V=bTY.
The first canonical variate pair−maxCorr(U1,V1)=ρ1, where U1=e1
TΣ11-1/2X and V1=f1
TΣ11-1/2Y.
The k th pair of canonical variates−maxCorr(Uk,Vk)=ρk, where Uk=ek
TΣ11-1/2X and Vk=fk
TΣ11-1/2Y.
9
Fundermental Theory (6/7)
Principle of CCA Theorem 1
Here ρ12≥ρ2
2≥…≥ρp2 are the eigenvectors of
Σ11-1/2Σ12Σ22
-1Σ21Σ11-1/2, and e1, e2, …, ep are the
associated(p×1) eigenvectors.
If the sample covariance of normalized observed values of streams is S , to get the correlation of two sets of data streams and identify the leading data streams in the correlation analysis, Σ11
-1/2Σ12Σ22-1Σ21Σ11
-1/2 in theorem 1 should be replaced by S11
-1/2S12S22-1S21S11
-1/2 , thus we can make correlation analysis.
10
Fundermental Theory (7/7)
Principle of CCA We transform the sample variance of any two data
streams and explain the statistics needed to keep as follows.
11
MGDS algorithms
12
Algorithm 1:Generate base statistic
13
Algorithm 2 :Analysis algorithm
14
Experiment (1/6)
Computer platform Intel (R) Core(TM)2 Quad CPU Q8400 / 2.66GHz /
3G / 250G OS is Windows xp sp3 using Matlab 2007b to run programs and generate
synthetic data sets
15
Experiment (2/6)
Data set Linear data set with noise
the values of every stream are got from linear increasing data of interval [0, 50000] and added by random values generated by N (0, 32).
we hope to have a high correlation. Gauss data set
the values of stream of the group X and Y are separately satisfied by N (50,152) and N (100,252).
we hope to have a low correlation. Real stock data set
15 stock data of Shenzhen Securities and Shanghai Securities from Jan.2005 to Dec.2010.
we hope to have a very high correlation.16
Experiment (3/6)
The relationship between streams’ quantity and used time of per analysis
We use gauss data set in this experiment to find the influence of MGDS and naive algorithm to every correlation analysis corresponding to values of (p+q) streams.
The size of base window is 500 values, the number of correlation query window is 30, the quantities of streams is {p=40,60,80,100,120;q=60,90,120,150,180}.
The naive algorithm calculates high-level statistic from raw values instead of base statistic. 17
Experiment (4/6)
18
Experiment (5/6)
The relationship between the size of base window and correlation coefficient
We use gauss data set, linear data set and real stock data set to find the effectiveness of MGDS with the changing size of base window.
The number of streams is p=q=15, the number of correlation query window is 5, and the changing size of base window is {W=50,100,150,200,250}.
19
Experiment (6/6)
20
Conclusions
This paper proposes MGDS algorithm based on base window.
MGDS algorithm overcomes the weakness of keeping all the values of other algorithms, and compresses original values to statistics, and correlation analysis is only based on these statistics, thus space and time complexity are reduced greatly.
The correlation analysis range of MGDS, not like other algorithms, is not limited in a single window, but can change flexibly depend on requirements, and the results of MGDS algorithm are accurate.
21
Thank You
For Your Listening
Q & A22