Mining group correlations over data streams

Mining Group Correlations over Data Streams

2011/12/02

Publication/ICCSE 2011

Presenter/Yuan-Chung Chang

Outline

Introduction Related Work Fundermental Theory MGDS algorithms Experiment Conclusions

2

Introduction

Mining correlation over steams attracts a lot of attentions recently.

Group correlation analysis over data streams is relatively few.

Existing literatures are mainly focused on a single time window, with large space and time complexity.

This paper proposes an online canonical correlation analysis algorithm called MGDS (Mining Group Data Streams). the MGDS algorithm dynamically maintains a few

statistics from raw data to calculate correlation. 3

Related Work

The correlation analysis of multidimensional or multiple data streams StreamSVD algorithm (2003)

StreamSVD samples the observed values depend on low rank approximation theory, and uses SVD theory to analyze the correlation of streams.

StreamCCA algorithm (2006) StreamCCA applied canonical correlation analysis(CCA) of

the classical statistical theory to the field of data streams.

StreamSVD and StreamCCA both need to keep the whole historical values of streams and can’t get the correlation of changing time range.

4

Fundermental Theory (1/7)

Definition Correlation coefficients

xji、 yjk respectively represent the j th values of the i th, k th data streams Xi,Yk

5


Definition Multidimensional data stream

Multidimensional data streams can be viewed as a mapping of one dimensional data streams.

e.g. at time j , the values of N streams is xj1, xj2, …, xji, …, xjN , then value of corresponding multidimensional data stream is [xj1, xj2, …, xji, …, xjN ] .

6


Definition Base window

Suppose there are N streams and the current time is t , in the time window w , the observed values [xti … x(t+w-1)i] (1 ≤ i ≤ N) of N streams consist of base window.

Correlation query window A set of successive base window

7


Principle of CCA Correlation analysis is a way of measuring the linear

relationship between two sets of data streams. Canonical variable Ui generated by X represents most of

information of X Canonical variable Vi generated by Y represents most of

information of Y

aiT and bi

T, which represent the weight of different dimensions of Ui and Vi in the correlation, are linear transformation.

8


Principle of CCA Theorem 1

Suppose p ≤ q and let the random vectors Xp×1 and Yq×1 have Cov(X)=Σ11(p×p), Cov(Y)=Σ22(q×q), Cov(X,Y)=Σ12(p×q), where Σ is full rank.

For coefficient vectors ap×1 and bq×1, form the linear combinations U=aTX and V=bTY.

The first canonical variate pair−maxCorr(U1,V1)=ρ1, where U1=e1

TΣ11-1/2X and V1=f1

TΣ11-1/2Y.

The k th pair of canonical variates−maxCorr(Uk,Vk)=ρk, where Uk=ek

TΣ11-1/2X and Vk=fk

TΣ11-1/2Y.

9


Principle of CCA Theorem 1

Here ρ12≥ρ2

2≥…≥ρp2 are the eigenvectors of

Σ11-1/2Σ12Σ22

-1Σ21Σ11-1/2, and e1, e2, …, ep are the

associated(p×1) eigenvectors.

If the sample covariance of normalized observed values of streams is S , to get the correlation of two sets of data streams and identify the leading data streams in the correlation analysis, Σ11

-1/2Σ12Σ22-1Σ21Σ11

-1/2 in theorem 1 should be replaced by S11

-1/2S12S22-1S21S11

-1/2 , thus we can make correlation analysis.

10


Principle of CCA We transform the sample variance of any two data

streams and explain the statistics needed to keep as follows.

11

MGDS algorithms

12

Algorithm 1:Generate base statistic

13

Algorithm 2 :Analysis algorithm

14

Experiment (1/6)

Computer platform Intel (R) Core(TM)2 Quad CPU Q8400 / 2.66GHz /

3G / 250G OS is Windows xp sp3 using Matlab 2007b to run programs and generate

synthetic data sets

15

Experiment (2/6)

Data set Linear data set with noise

the values of every stream are got from linear increasing data of interval [0, 50000] and added by random values generated by N (0, 32).

we hope to have a high correlation. Gauss data set

the values of stream of the group X and Y are separately satisfied by N (50,152) and N (100,252).

we hope to have a low correlation. Real stock data set

15 stock data of Shenzhen Securities and Shanghai Securities from Jan.2005 to Dec.2010.

we hope to have a very high correlation.16

Experiment (3/6)

The relationship between streams’ quantity and used time of per analysis

We use gauss data set in this experiment to find the influence of MGDS and naive algorithm to every correlation analysis corresponding to values of (p+q) streams.

The size of base window is 500 values, the number of correlation query window is 30, the quantities of streams is {p=40,60,80,100,120;q=60,90,120,150,180}.

The naive algorithm calculates high-level statistic from raw values instead of base statistic. 17

Experiment (4/6)

18

Experiment (5/6)

The relationship between the size of base window and correlation coefficient

We use gauss data set, linear data set and real stock data set to find the effectiveness of MGDS with the changing size of base window.

The number of streams is p=q=15, the number of correlation query window is 5, and the changing size of base window is {W=50,100,150,200,250}.

19

Experiment (6/6)

20

Conclusions

This paper proposes MGDS algorithm based on base window.

MGDS algorithm overcomes the weakness of keeping all the values of other algorithms, and compresses original values to statistics, and correlation analysis is only based on these statistics, thus space and time complexity are reduced greatly.

The correlation analysis range of MGDS, not like other algorithms, is not limited in a single window, but can change flexibly depend on requirements, and the results of MGDS algorithm are accurate.

21

Thank You

For Your Listening

Q & A22

Mining group correlations over data streams

Technology

Transcript of Mining group correlations over data streams