Temporal data mining: algorithms, theory and applications (TDM...

Temporal data mining:

algorithms, theory and applications

(TDM 2005)

Proceedings of a Workshop held in Conjunction with

2005 IEEE International Conference on Data MiningHouston, USA, November 27, 2005

Edited by

Sheng MaIBM Research, USA

Tao Li Florida International University, USA

Charles PerngIBM Research, USA

ISBN 0-9738918-3-1

The papers appearing in this book reflect the authors’ opinions and are published in the

interests of timely dissemination based on review by the program committee or volume

editors. Their inclusion in this publication does not necessarily constitute endorsement by

the editors.

©2005 by the authors and editors of this book.

No part of this work can be reproduced without permission except as indicated by the

“Fair Use” clause of the copyright law. Passages, images, or ideas taken from this work

must be properly credited in any written or published materials.

ISBN 0-9738918-3-1

Printed by Saint Mary’s University, Canada.

Published byDepartment of Mathematics and Computing Science

Technical Report Number: 2005-07 November, 2005 ISBN 0-9738918-3-1

Table of Contents

Workshop Committee

Forward

1. PREDICTING PROTEIN FOLDING STRUCTURES BY MEANS OF A NEW

CLASSIFICATION APPROACH Huy Pham, EVANGELOS Triantaphyllou ...........92. CoDOTS: secure outlier detection in distributed time series Josenildo Costa da Silva,

Matthias Klusch .........................................................................................................18

3. Grouping Multivariate Time Series: A Case Study Tamraparni Dasu, Deborah F.

Swayne, David Poole .................................................................................................25

4. Topographical Proximity: Exploiting Domain Knowledge for Sequential Data

Mining Ann Devitt, Joseph Duffin .............................................................................33

5. Mining Spatio-Temporal Association Rules, Sources, Sinks, Stationary Regions and

Thoroughfares in Object Mobility Databases Florian Verhein, Sanjay Chawla.......41

6. Tracking the Lyapunov Exponent in data streams Raphael Ladysz, Daniel Barbara

...................................................................................................................................53

7. Workflow Process Models: Discovering Decision Point Locations by Analyzing

Data Dependencies Sharmila Subramaniam, Vana Kalogeraki, Dimitrios Gunopulos,

Fabio Casati, Umeshwar Dayal, Mehmet Sayal, Malu Castellanos .........................61

8. Computing Information Gain in Data Streams Alec Pawling, Nitesh Chawla,

Amitabh Chaudhary ...................................................................................................72

9. Stream Mining for Network Management Kenichi Yoshida, Satoshi Katsuno,

SHIGEHIRO ANO, KATSUYUKI Yamazaki, Masato Tsuru .....................................82

10. Web Usage Mining: Extracting Unexpected Periods from Web Logs Florent

MASSEGLIA, Pascal PONCELET, Maguelonne TEISSEIRE, Alice MARASCU .....89

11. A Dissimilarity Measure for Comparing Subsets of Data: Application to

Multivariate Time Series Matthew Otey, Srinivasan Parthasarathy .......................101

12. Temporal Data Mining Based on Temporal Abstractions Robert Moskovitch,

Yuval Shahar ............................................................................................................113

13. Incremental Maintenance of Wavelet Synopses for Data Streams Ken-Hao Liu,

Wei-Guang Teng, Ming-Syan Chen .........................................................................116

14. Finding Temporal Association Rules between Frequent Patterns in Multivariate

Time Series Giridhar Tatavarty, Raj Bhatnagar ....................................................127

15. An Empirical Study on Multistep-ahead Time Series Prediction Haibin Cheng,

Pang-Ning Tan, Jing Gao, Scripps Jerry ................................................................13716. A PCA-based Kernel for Kernel PCA on Multivariate Time Series Kiyoung Yang, Cyrus Shahabi ..........................................................................................................14917. Fast similarity search of time series data using the Nystr£¤"om method Akira Hayashi, Katsutoshi Nishizaki, Nobuo Suematsu ...................................................15718. Identifying Temporal Patterns and Key Players in Document Collections Benyah Shaparenko, Rich Caruana, Johannes Gehrke, Thorsten Joachims .........................165

Workshop Committee

Workshop Co-Chairs

Sheng Ma, IBM T.J. Watson Research, [email protected]

Tao Li, Florida International University, [email protected]

Charles Perng, IBM T.J. Watson Research, [email protected]

Workshop Program Committee

Name Affiliation E-mail

Inderjit Dhillon University of Texas at Austin [email protected]

Carlotta Domeniconi George Mason University [email protected]

Christos Faloustos Carnegie Mellon university [email protected]

Johannes Gehrke Cornell University [email protected]

Oscar Kipersztok Boeing Research [email protected]

Wenke Lee Georgia Institute of Technology [email protected]

Feng Liang Duke University [email protected]

Bing Liu University of Illinois at Chicago [email protected]

Mitsunori Ogihara University of Rochester [email protected]

Srinivasan Parthasarathy Ohio State University [email protected]

Dennis Shasha New York University [email protected]

Hui Xiong University of Minnesota [email protected]

Tong Sun Xerox Research [email protected]

Philip S. Yu IBM Research [email protected]

Mohammed Zaki Rensselaer Polytechnic Institute [email protected]

Shenghuo Zhu NEC Laboratories America, Inc. [email protected]

Workshop Website: http://www.cs.fiu.edu/~taoli/workshop/TDM2005/

Forward

The 2005 Temporal Data Mining: algorithms, theory and applications workshop (TDM

2005) is the second workshop on this theme held annually with the ICDM Conference.

Through the workshop, we expect to bring together researchers from both industry and

academia with diverse backgrounds: data mining, machine learning, database, statistical

analysis, and application knowledge to foster interactions, to propose new ideas, to

identify promising technologies, to create a forum for discussing recent advances, to

better understand the practical challenges in applications, and to inspire new research

directions.

Many real-world applications deal with huge amounts of temporal data. Examples

include alarms/events and performance measurements generated by distributed computer

systems and by telecommunication networks, the web server logs, online transaction logs,

financial data, workflow process logs, and sensor data collected from sensor networks.

Conventionally, temporal data is classified to either categorical event streams or

numerical time series and both types have been intensively studied in data mining and

statistics. However, several previously less emphasized aspects of temporal data have

proven their importance in emerging applications and posed several challenges calling for

more research.

The major topics of the workshop include but are not limited to:

Temporal data benchmarking

Temporal pattern discovery

Temporal data clustering

Anomaly and change detection of streaming data

Prediction for temporal data

Temporal data characterization and analysis

Statistical analysis of temporal data

Accommodating domain knowledge in the temporal mining process,

Complexity, efficiency and scalability of temporal data mining algorithms,

Content-based search, retrieval for temporal data,

Process Mining,

Case Studies and Applications of Temporal Data Mining

o Adaptive workflow management

o Bioinformatics

o Information navigation

o Program behavior analysis

o Security management

o System management

o Web services and etc.

This year, we received over 30 submissions and the program committee finally selected

18 papers to include in the workshop program (about 50% acceptance rate). Most

submissions were reviewed and discussed by two reviewers and workshop co-chairs.

We are very indebted to all program committee members who helped us organize the

workshop and reviewed the papers very carefully. We would also like to thank all the

authors who submitted their papers to the workshop; they provided us with an excellent

workshop program. More information about the workshop can be found at:

http://www.cs.fiu.edu/~taoli/workshop/TDM2005/ .

November 2005

Workshop Co-chairs

Sheng Ma

Tao Li

Chang-shing Perng

PREDICTING PROTEIN FOLDING STRUCTURES BY MEANS OF A NEW CLASSIFICATION APPROACH

By H.N.A. Pham and E. Triantaphyllou Department of Computer Science, 164E Coates Hall, Louisiana State University, Baton Rouge, LA,

70803, U.S.A. Email: [email protected], and [email protected]

Abstract

Structure prediction problem for proteins plays an important role in the protein process. This

is a notoriously hard problem and been able to achieve good prediction performance with new methods

will certainly have an impact in both the computational arena but also in the Bioinformatics field. This

paper proposes a novel classification approach using a binary expansion method based on the density

concept for homogenous clauses to predict protein folding structures. The successes of this approach

are demonstrated on several protein data sets whose structure is partially known.

Keywords: protein folding, homogenous clause, binary expansion.

1. Introduction

Proteins naturally fold into complex 3D

globules from their amino acid sequence.

There are 20 different types of amino acids

labeled with their initials as: A, C, G, T, ... By

this presentation, a protein can be thought as a

sequence of ACTG… The protein process has

at least two distinct problems [6].

- Structure Prediction Problem: the problem

determines the 3D structure of a protein from

its amino acid sequence.

- Pathway Prediction Problem: the problem

determines the time-ordered sequence of

folding events from a given protein amino acid

sequence and its 3D structure.

Both problems have received attentions from

many researchers. The ability to predict protein

folding structure, however, can greatly

enhance structure prediction methods. The

structure prediction problem or Protein folding

problem can offer significant clues about the

function of a protein which cannot be found

via experimental methods quickly or easily.

Once a protein is identified, it can be applied

for increasing important players in human

disease, limiting the ability to effectively design

new proteins, and other applications.

In finding the 3D structure, a protein classified

into four structural classes: all- , all- , /, and

+ introduced by Levitt and Chothia [10]

according to their secondary structure

composition. Hence this problem can be stated as

a four-class classification problem. Once the

structural class of a protein is known, it can be

used to reduce the search space of the structure

prediction problem that most of the structure

alternatives will be eliminated, and the structure

prediction task will become easier and faster.

There have been many theoretical and practical

developments in the last ten years in this problem.

Many studies have resulted in classification and

prediction systems that are highly accurate or they

are not so accurate. Chou [1] assigned a protein

into one of the four structural classes by using

Amino Acid Composition (AAC) of a protein and

Mahalanobis distance. Wang et al. [2] tried to

improve Chou’s work using the same data set,

without success. Ding and Dubchak [3] used

Neural Networks (NNs) and Support Vector

Machines (SVMs) on classifying proteins into one

of 27 fold classes, which are subclasses of the

TDM 2005: 2005 Temporal Data Mining Workshop 9

structural classes. Tan and coworkers [4] also

worked on the fold classification problem (for

27 fold classes), using a new ensemble

learning method. More recently, Zerrin Isik et

al [5] used SVMs for Amino Acid

Composition (AAC) of the protein as the base

for classification.

A growing belief is that the root of not so

accuracy is the overfitting and

overgeneralization behavior of such systems.

Roughly speaking, overfitting means that the

extracted model describes the behavior of

known the training data set very well but does

poorly on new data points. Overgeneralization

occurs when the system uses the available data

and then attempts to analyze vast amounts of

data that has not seen yet. Both problems may

cause poor performance. This is a situation

studied in statistics and, to some extend, with

some of the data mining methods such as

decision trees, NNs, and SVMs.

This paper aims at presenting a useful

approach of overfitting and overgeneralization

for the purpose of controlling these two key

properties. By doing so, it is hoped that the

classification / prediction accuracy of the

extracted system will be very high or at least as

high as it can be achieved with the available

training data. In particular, the approach uses

the density concept of a homogenous clause

described in Section 2.1, and a binary

expansion approach in Section 3 to classify the

structure of proteins. In section 4, the

successes of this approach are demonstrated

and assessed on several protein data sets whose

structure is partially known in Ding and

Dubchak [3], and Zerrin Isik et al [5].

Basically, all of classification assessments in

the paper use the average accuracy introduced by

Rost & Scander, 1993 [12], Baldi et al, 2000 [11].

2. Preliminaries

2.1 Multi-class prediction method

Most of classification methods dealing with two-

class problems are often accurate and efficient, for

example SVMs or NNs. When dealing with more

classes, they, however, usually reduce accuracy

and efficiency. This section presents a method,

One-vs-Others, that utilizes two-class

classification methods as the basic building block

for larger number of classes. This is a simple and

effective method introduced by Dubchak et al

1999 [8], Brown et al 2000, [9].

In this process, suppose there are K classes in the

problem. K classes, firstly, are partitioned into a

two-class problem: one class consists of proteins

in one “true” class, and the “others” includes all

other classes. A two-class classification method

then is used to train for this two-class problem.

The process then partitions the K classes into

another two-class problem: one “true” class

consists of another original class and the “others”

class is the rest. Another two-class problem is

trained. This process is repeated for each of the K

classes, and this leads to K two-way trained

classifiers.

In the testing process, the system uses testing

queries for each of the K two-way classifiers and

determines the maximum of K scores from the K

classifiers. The maximum score is considered as a

classifier for the two-class problem: one class

consists of proteins in one “true” class, and the

“others” includes all other classes. All of steps of

this process are repeated for the K -1 remaining

classes.

2.2 Homogenous clause (HC) and its

density


Homogeneous is an adjective that has several

meanings. In biology homogeneous has a

meaning similar to its meaning in mathematics.

In physical chemistry, homogeneous describes

a single-phase system as opposed to

heterogeneous where more than one

thermodynamically distinct phase co-exists.

Homogenous (without the second e) has a

similar meaning of being the same throughout,

and is perhaps more common in everyday

speech.

In this paper, homogenous has a meaning

similar to the physical chemistry field. It

describes a steadiness for distinct phases co-

exist. A homogenous clause covers a set of

examples of a given class (i.e., positive or

negative) and unclassified examples uniformly.

That is, within the clause there are no

subdivisions with unequal concentrations of

classified (i.e., either positive or negative) and

unclassified examples.

For instance, Figure 1 depicts a situation

defined on two continuous variables X and Y.

In the same figure clause A is a non-

homogenous clause while clause B is a more

homogenous one. Please note that in these two

clauses only the classified data are shown as

small circles. The unclassified data are the rest

of the points of the X-Y plane.

Clause A, however, is replaced by two more

homogenous clauses denoted as A1 and A2.

Then the areas covered by the two new Clauses

A1 and A2 are more homogenous than the area

covered by the original Clause A.

From the above example, a judgment can be

applied for studying a new classification approach.

When a classification algorithm that infers a set of

classification rules from training examples is

applied, these rules may or may not be affected by

homogenous clauses and their density. In turn, this

may affect the accuracy in correctly classifying

new data points. For instance, the clause labeled as

“Clause A” in Figure 1 is not as homogenous as

Clause B in the same figure. Thus, it is possible

that unseen examples covered by clause A are

erroneously assumed to be in the same class as the

“solid” examples covered by the same clause. In

particular, this is most likely to occur in regions of

Clause A that are not populated by “solid” points.

Such a region, for instance, exists in the upper left

corner of Clause A (see also Figure 1). Another

similar region is the lower part of the same clause.

On the other hand, Clause B is more homogenous

than Clause A. Thus, it is more likely that the

unclassified examples covered by Clause B are

more accurately assumed to be in the same class as

the “solid” examples covered by the same clause.

The above simple observations lead one to surmise

that the accuracy of the classification rules can be

increased if the derived clauses are, somehow,

more compact and homogenous.

Figure 1: B, A1, A2 are homogenous clauses while A is a non-homogenous clause. A can be replaced by two homogenous clauses A1 and A2


Intuitionally, another factor also affects the

accuracy of the classification rules. That is the

density of a homogenous clause. For example,

the unclassified examples covered by Clause B

in Figure 1 are more assumed to be in the same

class as the “solid” examples covered by B

than Clause A1 or A2. Particularly, Clause B

may be expanded wider than Clause A1 and

A2 since B’s density is more concentrated than

other clauses. This section ends with a simple

definition for the density of a homogenous

clause that can be the number of examples of a

given class in a unit volume. This factor

decides how much that homogenous clause can

be expanded.

2.3 Accuracy measure

In two-class problems, assessing the accuracy

involves calculating true positive rates and

false positive rates. In multi-class problems,

particularly converted through the One-vs-

Others method, this assessment, however, has

to be extended suitable to adapt for more than

two classes. A simple standard assessment, Q,

was introduced by Rost & Scander, 1993 [12],

and Baldi et al, 2000 [11]. Suppose there are N

= n1 + n2 + … + nk test proteins where ni are

number of examples in class ith. Let C = c1 + c2

+ .. + ck be the total of proteins that are

correctly recognized, where ci is the number of

examples that are correctly recognized in class

ith. Therefore the accuracy for class ith is

Qi=ci/ni and the overall accuracy is Q=C/N.

An individual class contributes to the overall

accuracy in proportion to the number proteins

in its class. Hence each of Qi relates to the

overall Q by a weight wi=ni/N. The overall

accuracy is:∑

==

k

iii QwQ

1

If a protein is sequentially tested for all four

classes and one of them is correct then c = ¼.

Therefore in general, ci can be a real number.

3 Binary Expansion Approach (BEA)

Input: positive and negative examples

Output: a suitable classification

Step 1: Find positive and negative clauses by

using k-means clustering-based with the Euclidean

distance

Step 2: Find positive and negative homogenous

clauses from positive and negative clauses

respectively

Step 3: Sort positive and negative homogenous

clauses on densities

Step 4: FOR each homogenous clause C DO

+ Find C’s density, say D

+ Expand C by using D

Figure 2: The Binary Expansion Algorithm

This section outlines the binary expansion

approach to predict the folding structures of a

protein using the idea of “expanding homogenous

clauses”. Essentially, this approach is a two-class

classification method, and the protein folding

problem then uses the approach through One-vs-

Others method. Suppose each of proteins in data

sets is considered as a vector in n dimensions. The

Euclidean distance is used for computing distances

between proteins. In the training phase, the

intuition behind of this approach is to find positive

and negative homogenous clauses, and then to

expand each of homogenous clauses, considered

as spheres, until the area of this homogenous

clause overcomes a threshold based on that

homogenous clause’s density. Then the testing

phase uses expanded positive and negative

homogenous clauses to test structures for new

proteins. A detailed description of this approach is

in Figure 2.

At step 1, K-Means Training starts with generating

the k clause centers randomly and goes ahead by


fitting the data points in those clauses with the

Euclidean distance. This process is repeated

until all points are identified in clauses. If the

specified clauses of a given class are close

together, then they can be joined in a unique

clause. Remaining points that do not belong to

any clause are created in new clauses with

unique point.

The k-means clustering-based method is also

used for finding positive and negative

homogenous clauses from positive and

negative clauses. Only two differences are

while fitting the data points in those clauses,

the process is stopped when it hits into the

border of the positive or negative homogenous

clause. And the distance used in the process is

the minimum distance between any two points

of a given class in the training set. The sorting

for positive and negative homogenous clauses

decides the order that homogenous clauses are

expanded.

Step 4 is the main part of this algorithm. Suppose

homogenous clauses are sorted on densities. The

expanding process starts with a homogenous

clause that has the highest density and so on. For

the current homogenous clause considered as a

sphere, a new homogenous clause is expanded by:

Where R: Expanded HC’s radius

R1: HC’s radius

R2: Envelope’s radius

Envelope’s radius is a double radius of the current

homogenous clause. This formula quotes that the

density of a homogenous clause decides how

much that clause is expanded. The expanding

process stops whether any point differing the class

name occurs in the expanded region, the area of

the expanded region is greater than a multiple of

D, or the current homogenous clause’s radius is

greater than envelope’s radius. The overall

approach in 2D is presented in Figure 3

Homogenous Clauses Extended HC

Expand

Positive Clauses

Figure 3: The overall approach for positive examples in 2D

D

1x

2

R 121

−+= RRR


4 Results

This section presents test bed applications for

our method with the independent testing

method, and assessments based on the standard

accuracy measure introduced in Section

2.3.We firstly have applied the approach to

data sets studied by Chih. C.Chang and Chih.J.Lin at

www.csie.ntu.edu.tw/~cjlin/papers/guide/data/. This

data set consists of three small data sets whose

features are described in Table 1. Another data set

from Chih. C.Chang and Chih.J.Lin at

www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary has

assessed as in Table 3 and 4. Results obtained

from C.J.Lin’s experiments [13] and this approach

are in Table 1 and 2 respectively.

Training set Testing set #atts* C.J.Lin’s SVMsTrain_1

(3089 examples) Test_1

(4000 examples) 4 96.9%

Train_2 (391 examples)

Train_2 (391 examples) Cross validation

20 85.2%

Train_3 (1243 examples)

Test_3 (41 examples)

21 87.8%

Table 1: results of C.J.Lin’s SVM Source of the data set www.csie.ntu.edu.tw/~cjlin/papers/guide/data/, *Atts: Attributes

Training set Testing set #Fail Positive #Fail Negative QTrain_1 Test_1 9

(2000 positive examples) 22

(2000 negative examples) 99.25%

Train_2 Train_2 0 0 100% Train_3 Test_3 0

(41 positive examples) 0

(0 negative examples) 100%

Table 2: results of BEA Source of the data set www.csie.ntu.edu.tw/~cjlin/papers/guide/data/

The comparison in Figure 4 shows that BEA

provides around 15.5% improvement in

classification accuracy as the SVMs method.

We can explain this improvement throughout

the essential of the SVMs method. Since the

SVMs method uses hyperplans to classify

training points, it creates a wide undecided

region around seen points, and this leads

overgeneralization. In contract, BEA starts

with the homogeneity of points and the density

of homogenous clauses for expanding seen

regions. It may create expanded regions that can

satisfy both of fitting and generalizing properties.

So, this approach provides more classification

accuracy than other methods. Table 3 and 4 are

other test beds for the approach. These tables show

that, BEA obtains better classification rates using

more training data, which is as expected

.

75 .0%

80 .0%

85 .0%

90 .0%

95 .0%

100 .0%

Train _ 1 &Te s t_1

Trai n _2wi th cro ssva l i dati on

Trai n _3 &Te s t_3

C.J.Lin's SVMs BEA

Figure 4: BEA’s accuracy and C.J.Lins SVMs


Data #Atts #Exps* Q Data #Exps Q Train W1a 300 2477 Train w4a 7366

W2a 300 3470 85,97 % w1a 2477 85,79% w3a 300 4912 85,40 w2a 3470 86,57 w4a 300 7366 85,08 w3a 4912 86,16 w5a 300 9888 84,64 w5a 9888 85,41

Test

w6a 300 17188 84,18

Test

w6a 17188 84,83 Table 3: results of BEA

Source of the data set www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary, * Exps: Examples

Data #Atts #Exps Q Data #Exps Q Train a3a 122 3185 Train a7a 16100

a4a 122 4781 90,17% a3a 3185 94,98% a5a 122 6414 86,47 a4a 4781 94,92 a6a 122 11220 82,17 a5a 6414 94,92

Test

a7a 122 16100 79,99

Test

a6a 11220 96,95 Table 4: results of BEA

Source of the data set www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary

Data types Symbol #Atts # Training exps # Testing expsA.A.Composition C 21 605 385 Secondary struc. S 22 605 385

Polarity P 22 605 385 Polarizability Po 22 605 385

Hydrophobicity H 22 605 385 Volume V 22 605 385

Table 5: Six parameter datasets extracted from protein sequence Source of the data set http://www.nersc.gov/~cding/protein/

For the protein folding problem, the data set

we used for training and testing was selected

from the database built by Ding and Dubchak

[3]. This database has seven or more proteins

and presents all major structure classes: all- ,

all- , /, and +

with 27 most populated

folds [7]. Table 5 is a description of this database.

Results Obtained Results from Ding and

Dubchak’s method using SVMs and NNs, and

Zerrin’s method using SVMsAAC and SVMstrioAAC

for the same dataset are in Table 6 respectively.

Please note that Zerrin’s paper only assessed for

the data type of Amino Acid Compostion.

Data types Q1 Q2 Q3 Q4 Composition 44.9% 20.5% 71.44% 66.66%

Secondary struc. 35.6 18.3 Hydrophobicity 36.5 14.2

Polarity 32.9 11.1 Volume 35.0 13.4

Polarizability 32.9 13.2 Table 6: Results of and Dubchak’s paper [3], and Zerrin’s paper [5]

Q1: Accuracy of the SVMs Independent Test method in Ding’s assessment

Q2: Accuracy of the Neural Networks Independent Test method in Ding’s assessment

Q3: Accuracy of the SVMsAAC method in Zerrin’s assessment

Q4: Accuracy of the SVMstrio AAC method in Zerrin’s assessment


Data types all- all- /

/

Q5A.A.Composition 87.27% 74.81% 71.43% 91.95% 81.37% Secondary struc. 87.23 72.21 66.75 91.17 79.34 Hydrophobicity 86.75 74.55 71.17 91.69 81.04

Polarity 87.27 73.51 70.13 91.95 80.72 Volume 87.01 74.29 71.43 91.95 81.17

Polarizability 86.75 74.29 70.13 91.95 80.78 Table 7: results of BEA

Obtained results from BEA for the same dataset

are in Table 7. The comparison of the Qs in

Figure 5 shows that at the data type of Amino

Acid Composition, BEA provides around 10%

improvement in classification accuracy as the

SVMsAAC method, 43% improvement as Ding’s

SVM at the data type of Secondary Structure,

44% improvement as Ding’s SVM at the data

type of Hydrophobicity, 48% improvement as

Ding’s SVM at the data type of Polarity, 46%

improvement as Ding’s SVM at the data type of

Volume, and 47% improvement as Ding’s SVM

at the data type of Polarizability.

5 Conclusion

This paper has described the intensive novel

machine learning method to a notoriously hard

problem, structure prediction problem for

proteins. The comparison of experiments shows

that BEA provides 10-48% improvement in

classification accuracy. We have also obtained

better classification rates using more training

data, which is as expected.

References

[1]. Chou, K.C.: A novel approach to predicting

protein structural classes in a (20-1)-d amino

acid composition space. Proteins 21 (1995) 319–

344

[2]. Wang, Z.X., Yuan, Z.: How good is

prediction of protein structural class by the

component-coupled method. Proteins 38 (2000)

165–175

[3]. Ding, C.H., Dubchak, I.: Multi-class protein

fold recognition using support vector machines

and neural networks. Bioinformatics 17 (2001)

349–358

[4]. Tan, A.C., Gilbert, D., Deville, Y.: Multi-

class protein fold classification using a new

ensemble machine learning approach. Genome

Informatics 14 (2003) 206–217

0.0%

10.0%

20.0%

30.0%

40.0%

50.0%

60.0%

70.0%

80.0%

90.0%

C S P Po H V

Ding's NeuralNetworks

Ding's SVMs

SVMs-TrioAAC

SVMs-AAC

BEA

Figure 5: BEA’s accuracy and other methods


[5]. Zerrin Isik et al: Protein Structural Class

Determination Using Support Vector Machines.

Lecture Notes in Computer Science-ISCIS

(2004), vol: 3280, pp. 82.

[6]. Jason T.L. Wang et al: Data Mining in

Bioinformatics (2005), Chapter 7, Predicting

Protein Folding Pathway, 127-141.

[7]. Hobohm, U., Scharf et al: Selection of a

representative set of structures from the

Brookhaven Protein Bank. Protein Sci., 1,

(1992), 409-417.

[8]. Dubchack et al: Recognition of a protein

fold in the context of Structure Classification of

Proteins (SCOP) classification. Protein 35

(1999), 401-7.

[9]. Brown et al: Knowledge-based Analysis of

Microarray Gene Expression Data. Support

Vector Machines Proc. Natl Acad Sci., 97

(2000), 262-267.

[10]. Levitt, M., Chothia, C.: Structural patterns

in globular proteins. Nature 261 (1976) 552–558

[11]. Baldi,P. et al: Assessing the accuracy of

prediction algorithms for classification: an

overview. Bioinformatics, 16 (2000), 412-424.

[12]. Rost, B. and Sander, C.: Prediction of

protein secondary structure at better than 70%

accuracy. J.Mol. Bio., 232 (1993), 584-599.

[13]. C.-W. Hsu, C.-C. Chang, C.-J. Lin: A

practical guide to support vector classification

(July 2003).


CoDOTS: Confidential Detection of Outliers in Distributed Time Series

Josenildo Costa da Silva and Matthias Klusch

German Research Center for Artificial IntelligenceDeduction and Multi-agents System

Stuhlsatzenhausweg 3, 66123 Saarbrucken, Germanyjcsilva,[email protected]

Abstract

Privacy-preserving data mining is an exciting field of re-

search. However, there is still a lack of investigation on pri-

vacy preserving algorithms for time series data. In this pa-

per we present CoDOTS, which allows a group of sites in an

open data mining environment to preserve the confidential-

ity of local data while getting the benefits of a collaborative

mining. Additionally, we propose a framework that helps us

to capture the notion of confidentiality in distributed time

series analysis. An analysis of CoDOTS according with this

framework is provided as well.

1 Introduction

Outlier detection aims at finding interesting deviations

from an assumed model of normal behavior, and find ap-

plication in a variety of domains ranging from manufacture

control (anomaly detection), to intensive care monitoring

(emergency state detection), to medical injury identification

(anomaly detection), and science in general (infrequent pat-

terns).

Methods for outlier detection in distributed data sources

have to take additional constraints into account such as net-

work bandwidth and how data is shared due to privacy con-

cerns. Privacy preserving data mining is an increasingly

vibrant research area. The problem is how to effectively

preserve data privacy to a given extent and at the same time

get the benefits of collaborating with other sites to achieve

improved mining results.

As of today, no approach to privacy preserving dis-

tributed outlier detection in time series exists. In this paper,

we present a solution to this problem by means of a new

algorithm CoDOTS, whose basic idea is to use distributed

neighborhood counting in multi-dimensional data.

This paper is organized as follows. Motivation, related

works and problem definition are presented in sections 2, 4,

and 3, respectively. CoDOTS algorithm and our proposed

confidentiality framework are presented in sections 5 and 6,

respectively. We conclude in section 7.

2 Why privacy preserving time series analy-sis?

In many real word applications, such as in the medi-

cal and military domain, certain data are considered sen-

sitive, hence must be kept private to avoid any disclosure to

unauthorized personnel. Medical patient data are to be pro-

cessed, stored, and disseminated by respective medical staff

under strict law regulations. This is to avoid unauthorized

third parties like insurance companies or employers to, for

example, access the raw data and draw any conclusion about

individual health conditions of applicants and employees,

respectively. From processing of time series data like, for

example ECG, blood pressure, and weight evolution time

series, any insurance company might conclude that a patient

suffers from some heart disease, and, as a consequence, re-

fuses the respective insurance application without any fur-

ther approval by the treating doctor and explanation to the

applicant.

In economy, time series analysis is classically used to

perform forecasting of price evolution of business assets

such as stocks and products. Any member of temporarily

formed business alliance will strive to preserve privacy of

its local time series data, depending on the level of service

agreements and contracts to ensure future individual com-

petitive business strength. For example, privacy preserving

distributed time series analysis could be used to forecast the

price evolution of different products, or analysis results of

combined product component test data from different enter-

prises involved in its assembly. However, local strategical

data is not likely to be completely shared among competi-

tors.

On the other hand, we also have to hide the underpinning

model of time series analysis such as Fourier and wavelet

coefficients that are used to represent time series. Knowl-

edge about the selected model can be transferred more eas-


ily than considered volumes of raw data, hence used by any

attacker for an attempt to reconstruct private local data and

its origins.

To illustrate this discussion, let us propose a more con-

crete example of distributed outlier detection with privacy

issues. Suppose two or more clinics are interested in find-

ing anomalous ECG signals. Working alone each clinic is

able to finds a set of local unusual signals. Now, by collabo-

rating with other clinics in the group, let us say that clinic 1

discovers that one of its local outliers does occur very often

in other sites, therefore, it can be no longer considered as

an outlier. On the other hand, clinic 1 confirms its suspicion

that other local signals are very infrequent and keeps their

status as outliers. Such anomalies may indicate interesting

instances of a rare disease, or at least, may point to a de-

fective device used during pacient examination. The point

here is that if values of the ECG signals are to be kept confi-

dential, as any other medical data, then a privacy preserving

algorithm is needed to archive the above described scenario.

Another example shows how outlier detection may be

used by small retailers. Suppose a group of retailers work-

ing together to detect outliers/anomalies in the demand of

some product. By collaborating, they can distinguish lo-

cal anomalies from global trends, i.e. patterns that are in-

frequent locally but does occur very often in other sites.

Hence, the retailers may prepare more efficient sales strate-

gies, since outliers are discarded with a higher level of con-

fidence. Needless to say, retailers do not want disclose sales

information to its competitors, what calls for a privacy pre-

serving algorithm.

3 Related Work

Works on privacy preserving data mining follows three

main approaches. Sanitation, aims to modify the dataset

such that sensitive patterns cannot be inferred. It was de-

veloped primarily to association rule mining (cf. [3, 15]).

The second approach is data distortion, in which the true

value of any individual record is modified while keeping

“global” properties of the data (cf. [5, 2] among others). Fi-

nally, SMC-based approaches apply techniques from secure

multi-party computation (SMC), which offers an assortment

of basic tools for allowing multiple parties to jointly com-

pute a function on their inputs while learning nothing except

the result of the function (cf. [13, 16]).

The problem of distributed outlier detection has been in-

vestigated in the field of network intrusion detection and

sensor networks. For instance, in [12] a density-based ap-

proach is presented, in which density estimate is used to

model normality of data, and outliers are defined as points

with low regional density. Global outliers are detected by

combining the local models and data sampling sets of the

local sites.

In [6] a method for outlier detection in time series is

proposed. This method is based on projection pursuit and

search for outliers in the feature space produced by the

projection directions, which is shown to be more powerful

than testing multivariate series directly. Another interest-

ing work on this problem drops the assumption of a correct

model for time series and proposes a method using the no-

tion of deviant , which is based on the known problem of

optimal histogramming (cf. [7]).

There is not a lot of work on privacy preserving time se-

ries analysis. One exception is the work on collaborative

forecasting, recently presented in [4]. To the best of our

knowledge, current methods for outlier detection were not

primarily designed to handle distributed time series . More-

over, none of them address privacy issues.

4 Problem definition

Figure 1. Outliers at Peer 1 found in the localFoetal ECG time series data.

An outlier is a subsequence of the time series that occurs

with low frequency in the collection of reference (figs. 1

and 2). We represent a time series as one high dimensional

point, and an outlier as such a point with low number of

neighbors.

Definition 1 (Time series, outliers) Suppose a discrete time

series represented as a vector x = (x1, x2, . . . , xd). Given

a distance function D : Rd × R

d → R for arbitrary pairs

(x,y) of data points in S ⊂ Rd, we then define the neigh-

borhood of x with radius r ∈ R as:

N[S,D](x, r) = y ∈ S | D(x,y) ≤ r (1)

Given S, D and a threshold q ∈ N for the maximum number

of neighbors of any element x of S, we consider x as an

outlier of S if | N[S,D](x, r) |< q.


-100

-50

0

50

100

0 10 20 30 40 50 60 70 80 90 100

x(t)

Time stamp

Outliers at Peer 1 (Superposition)

Outlier AOutlier BOutlier C

Figure 2. Outliers oA, oB, oC in original timeseries of peer 1 (zoomed and superposed).

Analogously, given p data sets Sp ⊂ Rd, distance func-

tion D, and threshold q, any element x of some local data

set Sp is considered as a global outlier in S =⋃p

j=1 Sj

with respect to radius r if | N[S,D](x, r) |< q.

Definition 2 (Confidential detection of outliers) Let P =Pj | 1 ≤ j ≤ p denote a group of peer mining agents

each of which owning a local time series data set Tj ⊂R

d. The problem of distributed outlier detection on time

series with confidentiality is defined as follows. Given P , a

distance function D, collection of time series of the group

T =⋃p

j=1 Tj , constant radius r, and maximum number q

of neighbors, compute the set O ∈ P(T) such that:

1. ∀o ∈ O, |N[T,D](o, r)| ≤ q, i.e. the set of subse-

quences of T whose number of neighbors is less than

the given threshold q (correctness requirement).

2. Communication cost, in terms of message size, is min-

imized (communication requirement);

3. Privacy of local data is preserved to the maximum

(confidentiality requirement).

How to solve this problem? Transmitting all data to a

central location violates both the communication and con-

fidentiality constraints. Alternatively, one could detect lo-

cal outliers in the local data, and then collaborate to decide

whether the local outliers do not appear frequently in local

time series data of other peers. Intuitively, this approach

would meet both correctness and communication require-

ment, but is it also secure? How to find a data representation

that reveals as less information as possible to any attacker

while still enabling global outlier detection? How to mea-

sure the amount of information leakage of an arbitrary data

representation? Before we discuss these issues further in

section 6, we first present our algorithm for distributed out-

lier detection in time series.

5 The CoDOTS Algorithm

5.1 Data Transformation

CoDOTS is based on a transformation which works as

follows. Given a window size w, and base value v, every

element xi of the data point x = [x1, x2, . . . , xd] is re-

placed by its approximation x′i = xi

v v. Only elements

at w-interval positions (0, w, 2w, ...) are used to build the

transformed local time series x′ (fig. 3).

-100

-50

0

50

100

0 10 20 30 40 50 60 70 80 90 100

x(t)

Time stamp

Transformation of Outlier A and its Error

Tranformation of outlier AError

Figure 3. Outlier oA in tranformed time seriesof peer 1 using transformation f(·) of CoDOTSwith window size w = 1 and discretization de-gree v = 10.

5.2 Detailed Description

The main idea relies on the fact that the set of local out-

liers includes the set of potential global outliers, hence the

task is to select the true global outliers based on information

about data neighborhood gathered from peer mining agents.

Each local peer mining agent performs the following

steps of the CoDOTS algorithm. First, it negotiates the

parameters used, that is the distance function D, the grid

window w, the amount v of discretization, the radius r, the

dimension d (common size of each local time series), and

the maximum number q of neighbors allowed. After that,

each agent computes the set of local outliers Oj , i.e. points

that have less neighbors in Sj than q. We do not assume

any particular method for finding outliers, provided that the

outliers found are in accordance with our definition (few

neighbors). One option would be to apply the algorithm in

[10] and pick the patterns with low frequency.

The next step is to compute the set of transformed lo-

cal outliers O′j by applying the transformation presented

in the previous section to the original local data points Oj .

Since local outliers may have neighbors residing non local

data sets, each local peer sends to the Helper the set O′j of


Algorithm 1 Local Peer

Input: reference to helper H and to other peers Pi;

Output: a set of local outliers Oj ;

1: negotiate(D, w, v, r, m,q);2: Oj ← x | x ∈ Sj , |N[Sj,D](x, r)| < q;

3: O′j ← transform(Oj ,w, v);

4: C ′j ← ∅; Initializing the counting table

5: for each o′ ∈ O′j do

6: C ′j ← C ′

j

⋃〈o′, |N[O′j,D](o′, r)|〉;

7: end for8: send(H,C ′

j);9: receive(H, C ′);

10: for each 〈o′, n〉 ∈ C ′ do11: if n > q then12: Oj ← Oj \ o | ∀o(o′ = transform(o));

13: end if14: end for

Algorithm 2 Helper

Input: reference to local peers Pj ;

Output: global neighbor counting C ′ for each Pj

1: C ′ ← ∅;

2: for j = 1 to p do3: receive(Pj , C ′

j);4: for each 〈o′,m〉 ∈ C ′

j do5: if 〈o′, n〉 ∈ C ′ then if element exists

6: 〈o′, n〉 ← 〈o′, n + m〉; then update sum

7: else8: C ′ ← C ′ ⋃ 〈o′,m〉;9: end if

10: end for11: end for12: for j = 1 to p do13: send(Pj ,C ′);14: end for

outliers of its transformed local time series together with a

neighbor counting for each point, C ′j (fig. 4).

After that, the local agent waits until it receives an up-

dated counting table C ′′ from the helper, with is used to

compute final local outlier set, by dropping the points which

are found to have globally a number of neighbors greater

than the threshold q (fig. 5).

The distinguished helper agent firstly receives all count-

ing sets C ′j from the local peers and then computes the

global counting set C ′ by simply summing up the local

neighbor counting. Finally, it computes individual updates

C ′′j = C ′ \ C ′

j of the counting sets, and returns them to the

corresponding local peers Pj .

-100

-50

0

50

100

0 10 20 30 40 50 60 70 80 90 100

x(t)

Time stamp

Tranf. Outliers at Helper

Transf. Outlier A (from peer 1)Transf. Outlier B (from peer 1)Transf. Outlier C (from peer 1)Transf. Outlier D (from peer 2)

Figure 4. Collection of transformed outliers atthe Helper

-100

-50

0

50

100

0 10 20 30 40 50 60 70 80 90 100

x(t)

Time stamp

Neighbors found at Helper

Transf. Outlier A (from peer 1)Transf. Outlier D (from peer 2)

Figure 5. Example of two local outliers thatare found to be neighbors.

5.3 Performance Analysis

At the local peer the time complexity is linear on the size

of the local data set and size of the set of local outliers.

This is mainly because of the step 2 in algorithm 1, which

requires at least one pass through the local data set Sj . At

the helper the complexity is linear to the size of C ′ =⋃

C ′j ,

since |C ′| ≈ |⋃O′j | and the main loop iterates through all

elements of each C ′j , with 1 ≤ j ≤ p.

CoDOTS needs two rounds each of which with p mes-

sages to be exchanged between local peers and the helper.

Each message is of linear size O(|O′j |) (transformed points

+ neighbor counting). Besides, for a given grid window w,

it holds that |O′j | = |Sj |/w, since our data transformation

uses only elements at w-interval positions.

6 Confidentiality Analysis of CoDOTS

To carry out the analysis of the confidentiality of our al-

gorithm we first have to introduce our confidentiality frame-


work. In the section 6.2 we show a security analysis our

algorithm according with this framework.

6.1 Confidentiality Framework

The task of keeping information private relates to that of

keeping others from knowing this information for sure. In

[2], the notion of confidentiality is formalized as the interval

size where the value of a certain variable ranges with prob-

ability c%. This definition was further extended in [1] tak-

ing into account the probability distribution of the original

data. More recently a confidentiality measure was proposed

in [11] to indicate the privacy of a particular data set with

respect to a model describing it.

The goal of any privacy measure is to formalize how

close can the attackers get from the original data set, or put

it differently, what is the size of the interval in which the

reconstructed data points lies with probability 1? We ar-

gue that a measure of privacy has to take into account the

possibility of data reconstruction. As an illustration, con-

sider a data set S and its transformed version S′ = f(S),where f() is a randomization operator. But it was shown in

[8] that, under certain circumstances, the random noise can

be filtered out producing a very good reconstruction of S.

Hence, our framework is based on the notion of reconstruc-

tion accuracy, as discussed in details in the following.

Let S represent a set of data, the original data, and S′

represent some transformed version of S, through a trans-

formation process f(·), i.e. S′ = f(S). Let R represent a

reconstructed data set from S′ through an inverse transfor-

mation f−1(·), i.e. R = f−1(S). We define the reconstruc-

tion accuracy as follows.

Definition 3 Let the probability distribution of the error be-

tween S and R by pξ. So, we define of reconstruction error

as

εrec[f ] = 2h(ξ) (2)

where h(ξ) is the entropy of the error modeled by the prob-

ability function pξ, i.e.

h(ξ) = −∑

ε∈(εmin,εmax)

pξ(ε) log2(pξ(ε)) (3)

or h(ξ) = − ∫ +∞−∞ pξ(ε) log2(pξ(ε))dε if pξ is a density

function.

Our definition of εrec[f ] is mainly inspired by the no-

tion of privacy as defined in [1]. However, here we focus

on the error after the reconstruction and not before. Our

εrec[f ] represents the uncertainty inherent to the (probably

unknown, but approximately known) true probability func-

tion of the error pξ.

Another important issue of security analysis is the cov-

erage of data reconstruction. Intuitively, a reconstruction

procedure that manages to reconstruct the values in all orig-

inal dimensions is more threatening than another one that

reconstructs only half as many dimensions. Therefore, we

introduce the coverage of a reconstruction attack, given a

transformation f(·) as follows:

Definition 4 Let d denote the dimension of original data

and m the dimension of the reconstructed data, for a given

transformation f(·) and g(·) ∈ Rm, a process that tries to

compute its inverse. The coverage of f(·) is:

cover[f ] =m

d(4)

Disclosure of sensitive data from non-sensitive one is

known as the inference problem. It has been initially inves-

tigated in the area of statistical databases and more recently

in data mining. It challenges the security in any data min-

ing setting because allows attackers to infer (potentially sen-

sitive) information with a certain level of confidence from

public available data. This holds in particular in open dis-

tributed data mining environments, where everybody can

act as a potential attacker. In such environments, the ex-

changed intermediary information, such as models, may re-

veal sufficient information to allow for the reconstruction of

local sensitive data.

Other important threat is the possibility of collusion. A

collusion group is a partition of the mining group whose

members (the attackers) cooperate to disclose the data of the

peers in the other partition of the group (the victims). Col-

lusion is difficult to detect and to avoid [14]. Furthermore,

collusion may increase the accuracy of an inference attack,

since the malicious peers may collaborate by exchanging

information to reveal sensitive data of the victims.

Now let us extend our definitions including the scenarios

where a collusion may take place.

Definition 5 Let R be a reconstructed data set produced

through a collusion of k members of the mining group. We

denote by εrec[f ](k) ∈ R+ the reconstruction accuracy

of a given f(·) with k colluders. Similarly, we denote by

cover[f ](k) ∈ R+ the coverage of reconstruction of a given

transformation f(·) with k colluders.

Definition 6 (Confidentiality) We define confidentiality

measure of a given transformation f , denoted by conf[f ] :N → R+, as:

conf[f ](k) =εrec[f ](k)

cover[f ](k)(5)

Hence, the lower the coverage of the reconstruction

given a transformation, the more confidentiality f(·) pre-

serves.


Definition 7 (Inference Risk Level) Given a transforma-

tion f and k the number of colluders in a mining group, we

define the risk of inference inherent to f as the exponential

reciprocal of the confidentiality:

IRL[f ](k) = 2−conf[f](k) (6)

With this definition if IRL[f ](k) ≈ 0 we say that an al-

gorithm based on the transformation f(·) is secure against

inference even in presence of collusion of k peers. On the

other hand, if IRL[f ](k) ≈ 1, we say that it is insecure

against inference given collusions of k peers.

6.2 Analysis of CoDOTS

We focus the confidentiality analysis of the CoDOTS al-

gorithm on the transformation used in step 3 (cf. algorithm

1), since only transformed data is exchanged among the

peers.

Suppose an attack scenario where each attacker works

alone (collusion groups are singletons, k = 1). Now

let f denote the transformation procedure transform() in

CoDOTS. There are two possibilities of attack: local peer

attack and helper attack. The helper attack seems to be the

worst one, since it receives information from all local peers.

Therefore we will focus on it.

We assume that the transformation function f and its pa-

rameters w and v are not known to the Helper. However, the

helper can try to reconstruct a given x′ from the counting

sets it receives by extracting the set of transformed points

O′j from C ′

j . Without further information the helper needs

to estimate the discretization amount v used to transform

the data. To this end, the helper can compute the differ-

ence between two elements xa and xb of an arbitrary point

x′ ∈ O′j , for an arbitrary peer j,

v = min|x′a − x′

b| : ∀a,∀b ∈ (1, w), a = b (7)

Note that ∀a, ∀b ∈ (1, w), |x′a − x′

b| ≥ v by definition of

our transformation, i.e. v is a lower bound for v.

The helper may assume that an arbitrary element x′i in an

arbitrary data point was originally in the interval (xi, xi +v) with probability 1

v . Without any further information the

malicious helper can pick one value from this interval, but

the uncertainty of this choice is given by the entropy of a

random variable uniformly distributed on the interval (0, v),which in this case (uniform distribution) is log2v. We use

this result to claim that εrec[f ](1) = v by the definition of

εrec in the previous section. Note that our transformation

does not avoid v to take the true value of v, although v is a

lower bound for v.

Since the helper does not know the original dimension of

the data points d, it can only reconstruct points with dimen-

sion w, the dimension of the transformed points. The cov-

erage of the attack in this scenario is then cover[f ](1) = wd .

Therefore,

conf[f ](1) =vd

w≥ vd

w(8)

and IRL[f ](1) = 2−(vd/w) ≤ 2−(vd/w).

Now let us consider the second scenario, where k > 1.

In this case, we assume that k colluders know the trans-

formation function f(·), its parameters w and v, and the

original dimension d of data points. Without the helper,

the peers receive only a subset of their own counting table.

Therefore, we assume that the helper is also a member of

the collusion. Using the same procedure the helper now

can reconstruct the points using the true value of v. Fur-

ther, using the w-dimensional points, the colluders may use

some heuristic (e.g. seeing each multidimensional point as

a series and using sp-lines to fill the missing values) to get

reconstructed points with size d. The coverage in this case

is cover[f ](k) = 1. Consequently,

conf[f ](k) = v (9)

and IRL[f ](k) = 2−(v).

In summary, the confidentiality of CoDOTS is directly

related to the amount of discretization v used in the trans-

formation. That is, the transformation used in CoDOTS in-

troduces an amount of uncertainty at every single point of

the time series but preserves a sufficient amount of infor-

mation that allows collaborative mining agents to still per-

form a coarse-grained outlier detection by use of the trans-

formed series. Hence, the inference of exact knowledge

about local or global outliers from the transformed time se-

ries is avoided to a degree that depends on the chosen value

of parameter v with an inference risk low bounded within

(0, 1). Example applications and scenarios where only ex-

act knowledge on time series outliers is of real value to

attackers include prediction of the changes of energy con-

sumption behavior of clients while preserving the exact val-

ues and implied costs, anomalies in stock value evolution,

and any other business asset that relies on the exact knowl-

edge of considered time series outliers.

7 Conclusions

In this paper we presented CoDOTS as a solution to the

problem of privacy preserving outlier detection in time se-

ries data. We also introduced a confidentiality framework to

formalize the notion of privacy in a distributed data mining

environment. This framework assumes the mining group

being formed by potentially malicious parties, which may

form collusion groups. Future development involves substi-

tuting the transformation function (step 3 in algorithm 1) by

other dimensionality reduction/discretization technique and

analyze its impact in the privacy level. Another possible ex-

tension is to modify the algorithm to cope with distributed


stream mining. As far as the framework is concerned, we

have plans to apply it to other distributed data mining algo-

rithms, such as distributed data classification, looking for a

general confidentiality framework for distributed data min-

ing.

Acknowledgments

The authors are very grateful to the UCR Time Series

Data Mining Archive [9] for providing us the with time se-

ries data. This work has been supported in part by the Ger-

man Ministry of Education and Research (BMBF 01-IW-

D02-SCALLOPS) and the Brazilian Ministry for Education

(CAPES 0791/024).

References

[1] Dakshi Agrawal and Charu C. Aggarwal. On the de-

sign and quantification of privacy preserving data min-

ing algorithms. In Proceedings of 20th ACM Sympo-

sium on Principles of Database Systems, pages 247–

255, Santa Barbara, Califonia, May 2001.

[2] Rakesh Agrawal and Ramakrishnan Srikant. Privacy-

preserving data mining. In Proc. of the ACM SIGMOD

Conference on Management of Data, pages 439–450.

ACM Press, May 2000.

[3] M. Atallah, E. Bertino, A. Elmagarmid, M. Ibrahim,

and V. Verykios. Disclosure limitation of sensitive

rules. In Proceedings of 1999 IEEE Knowledge and

Data Engineering Exchange Workshop (KDEX’99),

pages 45–52, Chicago,IL, November 1999.

[4] Mikhail Atallah, Marina Bykova, Jiangtao Li, Keith

Frikken, , and Mercan Topkara. Private collaborative

forecasting and benchmarking. In Proceedings of the

2004 ACM workshop on Privacy in the electronic so-

ciety, pages 103 – 114, New York, NY, USA, 10 2004.

ACM, ACM Press. CERIAS TR 2004-50.

[5] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke.

Privacy preserving mining of association rules. In Pro-

ceedings of 8th ACM SIGKDD Intl. Conf. on Knowl-

edge Discovery and Data Mining (KDD), Edomonton,

Alberta, Canada, 2002.

[6] Pedro Galeano, Daniel Pena, and Ruey S. Tsay. Out-

lier detection in multivariate time series via pro-

jection pursuit. Statistics and Econometrics Work-

ing Papers ws044211, Universidad Carlos III, De-

partamento de Estadıstica y Econometrıa, September

2004. available at http://ideas.repec.org /p/cte/wsrepe/

ws044211.html.

[7] H. V. Jagadish, Nick Koudas, and S. Muthukrishnan.

Mining deviants in a time series database. In Mal-

colm P. Atkinson, Maria E. Orlowska, Patrick Val-

duriez, Stanley B. Zdonik, and Michael L. Brodie,

editors, VLDB’99, Proceedings of 25th International

Conference on Very Large Data Bases, September 7-

10, 1999, Edinburgh, Scotland, UK, pages 102–113.

Morgan Kaufmann, 1999.

[8] Hillol Kargupta, Souptik Datta, Qi Wang, and Kr-

ishnamoorthy Sivakumar. On the privacy preserving

properties of random data perturbation techniques. In

ICDM ’03: Proceedings of the Third IEEE Interna-

tional Conference on Data Mining, page 99, Washing-

ton, DC, USA, 2003. IEEE Computer Society.

[9] E. Keogh and T Folias. The ucr time series data

mining archive, 2002. http://www.cs.ucr.edu/ ea-

monn/TSDMA/index.html.

[10] E. Keogh, S. Lonardi, and B.Chiu. Finding sur-

prising patterns in a time series database in linear

time and space. In Proceedings of the International

Conference on Knowledge Discovery and Data Min-

ing (KDD’02), pages 550–556, Edmonton, Alberta,

Canada, July 2002.

[11] Srujana Merugu and Joydeep Ghosh. Privacy-

preserving distributed clustering using generative

models. In Proceedings of the 3rd IEEE International

Conference on Data Mining (ICDM 2003), 19-22 De-

cember 2003, Melbourne, Florida, USA. IEEE Com-

puter Society, 2003.

[12] Themistoklis Palpanas, Dimitris Papadopoulos, Vana

Kalogeraki, and Dimitrios Gunopulos. Distributed de-

viation detection in sensor networks. SIGMOD Rec.,

32(4):77–82, 2003.

[13] Benny Pinkas. Cryptographic techniques for privacy-

preserving data mining. ACM SIGKDD Explorations

Newsletter, 4(2):12–19, 2002.

[14] Sarvapali D. Ramchurn, Dong Hunyh, and

Nicholas R. Jennings. Trust in multi-agent sys-

tems. Knowledge Engineering Review, 2004.

[15] Yucel Saygin, Vassilios S. Verykios, and Ahmed K.

Elmagarmid. Privacy preserving association rule min-

ing. In Reseach Issues in Data Engineering (RIDE),

2002.

[16] Jaideep Vaidya and Chris Clifton. Secure set inte-

section cardinality with application to association rule

mining, March 2003. Submited to ACM Transactions

on Information and Systems Security.


Grouping Multivariate Time Series: A Case StudyTamraparni Dasu

AT&T Labs - ResearchEmail: [email protected]

Deborah F. SwayneAT&T Labs - Research

Email: [email protected]

David PooleAT&T Labs - Research


Abstract— We present a case study to demonstrate a processfor grouping massive multivariate time series based on nonpara-metric statistical summaries aided by information visualization.We want a method that allows us to quickly find approximategroups in time series, both to identify typical aggregate behaviorsand to find aberrant outliers.

We use simple statistical summaries to capture the temporalnature and variability of the time series, as well as the interactionbetween the various multivariate components. Each individualtime series is mapped to a fixed-length vector of summaries. Thesummary vectors are then clustered using any fast clusteringalgorithm like k-means. Appropriate information visualizationtechniques are used at every stage to guide the analyst. Becausethe method is nonparametric, it is customizable and flexible, andit generalizes easily. When choosing the statistical summaries, wecan incorporate domain knowledge that may enhance the cluster-ing. We demonstrate with a massive real life telecommunicationsapplication.

I. INTRODUCTION

Clustering time series is an important problem that hasattracted considerable interest under the larger umbrella oftemporal data mining. Practical applications abound bothin industry and in scientific fields, and include identifyingcustomers with similar growth patterns, grouping networkelements that exhibit similar failure histories, and isolatingbatches of patients with similar disease progression.

Time series can be large in three ways, posing challenges toanalysis: the length of the time series, the number of samplesor data points, and the number of attributes recorded at eachtime step. Time series can be arbitrarily long depending onthe granularity of measurement and the historical availabilityof the data. In the telecommunications industry, time series ofinterest may span many hundreds of time points; for example,even a week’s worth of network data measured at hourlyintervals has 168 time periods. The number of samples ordata points can represent hundreds of millions of residentialcustomers or network elements. Furthermore, many interestingtime series are multivariate, adding to the challenges ofscale. Customers buy different products and pay by differentmeans in each time period; network elements have multiplecharacteristics, such as the traffic in either direction, packetloss and types of applications; patients may be subjected toseveral tests to measure vital signs or symptoms at each timepoint.

Much of the literature in temporal data mining deals withsingle component (univariate) time series. Univariate timeseries are typically mapped to fixed length vectors of pa-rameters using models such as Fourier coefficients, wavelet

coefficients or ARMA (Auto-Regressive Moving Average)models ([16], [2]), after which the vectors of parameters areclustered. These methods, however, are not appropriate formassive multivariate time series. Fourier and wavelet methodsare either undefined or very expensive to compute in morethan two to three dimensions, and methods like ARMA requirestrong linearity assumptions which are restrictive. Nonlinearmodels for parametrizing multivariate time series exist, buttheir computation typically requires multiple passes renderingthem infeasible for massive data.

In this paper, while we address the problem of grouping timeseries, our emphasis is slightly different from that of conven-tional time series clustering. Our data sets typically consist ofmillions of observations with numerous attributes measured ateach time point. In the past, we have been effective at dealingwith massive high dimensional data by partitioning ([6]) ourdata sets into more manageable chunks either for screeningpurposes (removing data glitches) or for further analysis usingmore rigorous methods. In this same spirit of partition-basedanalysis, our purpose in this paper is not so much optimalseparation as an approximate grouping based on multivariatepartitions. We want a fast, computationally light method forrapidly teasing out representative groups and outliers. It shouldbe flexible enough to be used by people with various skills andto be customized based on knowledge of the domain and thedata. In the tradition of exploratory data analysis, visualizationplays a critical role in our methodology.

In typical temporal data mining literature, visualization isused to display the individual time series or subsequences,and to illustrate the similarities among members of a cluster.Visualization plays a wider but univariate role in [8], ageneral purpose exploration and visual mining technique fortemporal data. Our goal here is very specific. We want togroup massive multivariate time series and our visualizationis aimed at learning distributional behavior, statistical proper-ties and interdependence of attributes. In addition to guidingthe clustering process, our visualization helps us interpret,characterize and understand the groups we discover. Picturesthat characterize the clusters help us communicate the resultseasily to consumers of this information, an important goal initself.

The rest of the paper is organized as follows. In Section II,we describe the methodology. In Section III, we introduce ourcase study based on telecommunications data. Section IV de-scribes the results and interpretation. In Section V, we discussour future work, primarily cluster migration. In Section VI we


summarize the problem, methodology and the findings.

II. METHODOLOGY

Let

X Xijt i n j d t T

represent a multivariate time series, where i stands for thesample index, j stands for the attribute index and t indexestime. We define the following process.

A. Preprocessing

We inspect the data, looking for recording errors, damagedor mangled records, and data anomalies, such as records thatare all zero or all missing. We might have domain knowledgethat tells us about eccentric conventions like “Do not populatethe Revenue field if the customer spends less than a dollar,”leading to missing values for that attribute. Such conventionsare common in legacy systems and are seldom documented.

Obvious groups are sometimes separated at this stage tosharpen the clustering process. In our application, there wasa big chunk of records that had all zeroes or was sparselypopulated by ones. We chose to filter them out manuallysince we want to distinguish such records from those thatare populated with low values. We will explain this furtherin Section III.

We may also compute appropriate transformations of someof the initial attributes.

B. Nonparametric Summaries

During this stage we compute marginal summaries andlongitudinal summaries, and select the variables we will usefor clustering.

Marginal summaries at time t are of the type Mjt wherethe computations are performed across the records, holdingthe attribute j and time t fixed. The name marginal summariesderives from the fact that the computations utilize P tXj, themarginal distribution of the component X j at time t. Examplesof Mjt include component-wise mean, standard deviation,and quantiles. Such summaries can be computed rapidly ina single scan of the data, even order-based summaries likequantiles [9].

Visualizing the marginal summaries enables us to per-form an informal variable selection, removing attributes thatare highly correlated and those with very little variability.Correlated attributes will adversely influence clustering andattributes with hardly any variability will have no effect on theclustering. Similarly, attributes that are relevant to or populatedfor a very small percentage of the records have little or noinfluence on the clustering algorithm and should be screenedout manually.

Longitudinal summaries are computed across time withina record, akin to the “signatures” discussed in [3]. Suchsummaries capture temporal behavior and include summariesacross time such as slope, quantiles of first order differences ofthe time series, number of changes in direction, and quantilesof the original time series itself. Longitudinal summaries

are of the form LijX where the computations are doneacross time, holding the indices i and j fixed. We map anindividual time series to a fixed length vector containing thesenonparametric longitudinal summaries. Note that we do notmake any model assumptions (linear trend, normal error).Since our summaries are so general, they are widely applicableand can be customized to any given application.

When standard statistical summaries don’t seem to beappropriate, an alternative multivariate summary to considerat this point is a multivariate frequency table where we binthe entire time series into a fixed length vector of frequenciesusing a suitable multivariate binning scheme. We can augmentthis with temporal information in longitudinal summaries suchas those discussed in Section III. Recent literature [2] hasshown that coarse binning is often quite effective.

C. Clustering

After we compute the summaries, we cluster the data using afast, convenient algorithm such as k-means [4] and evaluate theresults. (See [5] for an overview of algorithms for clusteringmassive data.) At every step, we use graphics to help usvalidate our choices and to guide the next step in the analysis.

1) Assessment of Clustering: An informal visual techniqueto find an appropriate number of clusters is to plot a measureof goodness-of-fit (R-square, Akaike’s Information Criterion(AIC) or Cubic Clustering Criterion (CCC) [12]) against thenumber of clusters and then read off a suitable choice. Tothe best of our knowledge, there is no authoritative, widelyaccepted solution to choosing the number of clusters andassessing goodness-of-fit of the clusters.

We assess the “layout” of the clusters by visualizing therelative sizes of the clusters and the distances between them.This is a convenient way to isolate outlying clusters visuallyand collapse any redundant clusters that may have beenartificially created to accommodate the number of clustersspecified to the algorithm. This technique can be used by anon-expert competently at an intuitive level.

2) Attributes: We characterize the clusters in terms of theirattributes, with the goal of making statements such as this:“Cluster 1 consists of customers that spend most of theirtime on e-mail and cluster 2 consists of customers that spendmost of their time browsing websites.” We perform visualcomparison and interpretation of the clusters using “pin plots,”parallel coordinate plots, boxplots and “bubble plots”. Takentogether, these plots allow us to learn what is distinctive abouteach cluster, which clusters are similar to each other, whichattributes have the greatest impact, and which clusters havethe greatest variance.

3) Longitudinal Profiles: The temporal nature of the clus-ters is key to their characterization. “Growth clusters” mightcontain desirable customers that are ramping up their networkusage whereas “Steady clusters” might contain customers thathave optimized their network needs. An interesting predictivetask is to isolate clusters that represent transitory stages suchas “Potential decline or growth” and use them to anticipateand preempt customer attrition.


Plotting individual time series is feasible for small datasets, but not appropriate for massive data sets such as ours.Instead, we plot time series of representative aggregates ofindividual clusters such as percentiles of dominant attributes.This method ensures that we can visually follow aggregatetrends without the clutter of the individual time series. enableus to interpret clusters created using highly transformed andderived variables in the space of the raw data.

4) Outliers: We identify clusters that are small or far awayfrom others and investigate them in the same manner. Bubbleplots of clusters in two-dimensional projections allow us toquickly identify clusters that are far afield. Typically thesetend to be small, consisting of a handful of data points. Itis worth visualizing them individually to single out profitableoutliers (rapidly expanding customer) and problematic data.

5) Visualization: We prefer plots that are easily inter-pretable, because they help us communicate about the data.The plots we use are, except where otherwise noted, createdin the R statistical software [10].

Most of the visualization methods we describe depend onlyon the summary values that capture each cluster, so they arenot affected by the scale of the data.

III. A REAL LIFE APPLICATION

We demonstrate our methodology by applying it to a realtelecommunications usage data set observed during 2004.While we describe attributes in general terms for proprietaryreasons, the application is genuine with real implications forbusiness.

A. Data Set Description

We observed the pattern of usage of a telecommunicationsservice by 1,673,696 M users over 36 consecutiveweeks. During a given time period, a user may use the servicemultiple times. At each point in time we measured:

Frequency-of-use: the number of times the service wasused during that week,

Extent-of-use: the total time the service was in use duringthe week,

Intensity-of-use: the proportion of times the service wasused “Briefly”, “Moderately”, “Heavily” and “Intensely”,where each of the categories has a hard definition basedon numeric thresholds,

Time-of-use: the proportion of times the service was usedduring pre-defined “Peak”, “Evening” and “Late Night”periods of a 24-hour day.

Each of these 1.67 Million users used the service at leasttwo times in the 36 time periods. Others who used it exactlyonce were excluded because of data quality concerns relatedto the way single uses in sparse usage patterns are processed.This decision was validated by examining the total usage ofthe excluded observations outside of the study period, whichproved to be insignificant.

B. Preprocessing and Derived Variables

Computing derived variables is an important stage wherethe data can be customized to the application. For example,we can incorporate domain knowledge (e.g., weight month n

appropriately because the service was not available until themiddle of that month) or transform the variables to make themmore suitable for the analysis without affecting the results.This gives both control and flexibility to the user.

For our application, we started with simple summariesand histograms of the attributes. Some attributes were highlyskewed with long right tails in the distribution. We appliedan appropriate log transform to reduce the skewness. Wecomputed both pairwise and partial correlations and eliminatedcorrelated attributes. (The threshold we chose was 0.7.) Inaddition, we computed summaries to capture longitudinal ortemporal variations.

Our list of derived variables is chosen in the spirit of ournonparametric approach. Our summaries (means, differences,quantiles) are generally applicable in any context and are easyto compute. We also explored a host of others such as coeffi-cients of skewness and kurtosis but in the end we chose thosethat were representative, carefully avoiding duplication. Wealso avoided model-based parameters like Fourier coefficientsor ARMA regression coefficients.

The above approach enables us to map arbitrarily longtime series to a fixed-length vector of summaries that canthen be fed to the clustering algorithm. This is a well-knowntechnique for clustering time series, for instance see [16].However, by relying on nonparametric summaries insteadof model parameters, we gain several advantages. First, wecan customize the summaries to the desired granularity andspecificity ensuring better results. Second, since the summariesare more robust than the raw time series, they are a better inputto the clustering algorithm. Third, we can capture the attributedependence through simple summaries like pairwise correla-tions and partial correlation coefficients or even multivariatehistograms where feasible.

Some of the derived variables were:

Initial values — Number of times the service was usedand the total duration for which it was used in the initialtime period,

Final values — Number of times the service was usedand the total duration for which it was used in the finaltime period,

The range of usage, namely

R MaxusageMinusage

over the 36 periods of time, Average duration for which the service was used during

the 36 time periods, The average proportion (over the 36 time periods) of

times the service was used in each of the intensity-of-use categories e.g., “Moderately”.

The average proportion of times the service was usedduring each of the “time-of-day” buckets e.g., “Peak”.


••

•

••

•

•

• •• •

••

•

• • ••

• • •

•

•

••

•

•

•• • •

• •

••

•

Weeks0 10 20 30

300

320

340

360

380

400

420

Fre

quen

cy o

f use

(m

illio

ns)

Fig. 1. Time series of the total weekly frequency of use by 1.67 million users.The two inverted spikes and the final point correspond to weeks containingmajor holidays. Within the series there are periods of slight decline andgradual increase.

The number of zeroes in the time series, namely thenumber of time periods in which the user had no usage,

The number of times the use of the service increased inconsecutive time periods, as well as the number of timesit decreased,

The overall percentage change in the usage from the firstperiod to the final period,

Typical first order differences of (1) the number of timesa service was used and (2) the total duration for which itwas used during a time period.

After we chose the derived attributes, we further weeded outthe ones that were correlated so that the clustering algorithmcould perform accurately. We ended up with the followingfinal set of attributes :

Total frequency-of-use across the 36 weeks (t freq), Average extent-of-use for each use of the service (hold), Mean proportion of times it was used “Moderately”

(mp mod), Mean proportion of times it was used during “Peak”

periods (mp peak), Number of times the frequency-of-use increased

(s chng p), The percentage change in frequency-of-use from initial

to final time period (slope), and The typical first order changes in the frequency-of-use

(m freq diff) and extent-of-use (m mts diff).For our convenience, we chose k-means clustering offered

by SAS in PROC FASTCLUS [13], but any clustering mech-anism that can cluster large data sets can be used. Note thatk-means standardizes the data (i.e. centers the data at the meanand scales by the standard deviation) to keep attributes withhigh variability from dominating the analysis.

IV. THE RESULTS

The clustering exercise resulted in the following coarsegrouping – two very large clusters (#6 and #3), a few medium

•• •• •• •• •• •• •• •• •• •• •• •• •• •• •

• ••••

••

Cluster 6 (519363)

0 10 20 30

01

23

4

•

•

•

•• ••

•• •• •

• •• •• •• ••

•

•••

•

•

•• •• ••••

•

Cluster 5 (1303)

0 10 20 30

0.03

50.

045

0.05

5

• • •••• •• •• •• •• •• ••

•••• •• •• ••

•• ••

•

•••

Cluster 18 (42)

0 10 20 30

0.0

0.5

1.0

1.5

• • •• • •• •• •• •• •••• •

•

•• •• •• ••

•• •• •

• •• •

Cluster 19 (15)

0 10 20 30

0.0

0.2

0.4

0.6

Weeks

Fre

quen

cy o

f use

(m

illio

ns)

Fig. 2. Time series of total weekly frequency of use for four of the clusters.The cluster sizes are shown in parentheses. Cluster 6 is a large cluster (519K)with low initial usage yet rapid growth in the final third of the series. Incontrast, cluster 5 is a smaller cluster where usage is clearly in decline.Clusters 18 and 19 are tiny clusters whose usage growth rate is phenomenal,particularly in cluster 19 where the growth started after the first 12 weeks.

sized clusters (18, 19, 15 and 5) and several tiny ones. Thebiggest cluster (Cluster 6) accounts for a bulk of the data andrepresents the typical steady customers that have reached theirpotential and are relatively big users. The other large cluster,Cluster 3, represents new customers that are small users but arein a steep growth phase. The other clusters are differentiated onattributes such as intensity of use and rate of change in usage.The smallest clusters are outliers resulting from abnormalcircumstances such as natural disasters or truly distinctivecustomers (for example, having identical usage every day).

We start with an initial look at the aggregate plots ofinteresting attributes to observe general trends. Due to spaceconstraints we show only a subset.

The overall trend in usage of the service is shown inFigure 1. It is essentially flat over the 36 weeks, althoughthere are shorter periods of gradual decrease and increasewithin the series. The inverted spikes occurred in weeksthat contained major public holidays, and are not surprising.Figure 2 shows how some of the individual clusters behave;there is considerable variability in the data.

A. Number of Clusters

We used an informal goodness-of-fit metric to choose thenumber of clusters. The measure, R-square, represents theproportion of variability in the data explained by the clustering.The metric starts off at 0.1 for 1 cluster, and improves rapidlyuntil around 10 clusters and then starts flattening out. At 20clusters, it is around 0.52. At 150 clusters, it is about 0.72.

Bubble plots are a representation of the layout of the clustersin two-dimensional space. In Figure 3, at 15 clusters thereis a nice separation. At 20 clusters, there is now a pair ofsmaller clusters in the bottom left. Cluster 15, in particular,seems redundant, but the leftmost plot in Figure 4 shows thatit is well-separated in another dimension. Further clustering


−1 0 1 2 3 4 5

−1.

00.

01.

02.

0

15 clusters

t_freq

s_ch

ng_p

−1 0 1 2 3 4 5

−1.

00.

01.

02.

0

20 clusters

t_freqs_

chng

_p

3

5

615

1819

−2 0 2 4

−2

−1

01

2

30 clusters

t_freq

s_ch

ng_p

Fig. 3. Bubble Plots for 15, 20 and 30 Clusters. In these two-dimensional scatterplots of cluster means, the size of a “bubble” is roughly proportional tothe size of the cluster. Here we compare the results of k-means clustering as we vary the number of clusters in plots of s chng p vs t freq. For 15 clustersthere is a nice separation of cluster means, with two very large clusters and a sprinkling of smaller ones. For 20 clusters, there is some additional separationof the large clusters. Further clustering seems to yield no substantially different clusters.

−1 0 1 2 3 4 5

020

4060

8010

012

0

t_freq

hold

3

5

6

15

1819

−1 0 1 2 3 4 5

020

040

060

0

t_freq

slop

e

35

6

15

18

19

Fig. 4. Bubble plots. These plots help us characterize and compare clusters. In the leftmost plot of hold vs t freq, we see that Cluster 15 is most clearlydistinguished by its unusually long holding times. We also note that Cluster 5 has high holding times. In the rightmost plot, slope vs t freq, we note inparticular that Clusters 18 and 19, both have unusually high slopes, indicating increasing usage.


(30 clusters) seems to yield no substantially different clusters.The rest of the discussion in this section is thus based on ourchoice of 20 clusters.

The pairwise plots in Figure 4 provide further evidence thateach split (clusters 18 and 19; clusters 3 and 6) is indeedwarranted by a genuine separation in space and is not merelya partitioning of relatively evenly distributed attributes.

B. Cluster Characteristics

Once we settle on a number of clusters, we examine theindividual cluster characteristics for validation and interpreta-tion. As seen in the bubble plots, there are two huge clusters,several smaller ones, and numerous small outlier clusters. Weexamine some of these here.

A good way to characterize clusters is to observe therelationship between selected attributes across clusters. Bubbleplots are particularly effective at this, as has been discussedearlier. Figure 4 shows plots of hold (holding time) and slope

(rate of change in usage) against total usage over the 36 weekperiod. From the picture it is clear that the big clusters differbased on their total usage and the rate of change in the usage.The outliers too stand out in their high usage or high growthprofiles. Some of the smaller clusters that seem close togetherin this particular projection are separated in other dimensionssuch as mp peak and mp mod.

Figure 5 shows the two different ways to look at clusterprofiles. In the top row of pin plots, the X-axis denotes theclustering attributes, and the Y -axis represents the standard-ized value of center of the cluster: that is, the average valuesof each of the attributes for that cluster, indicated by the heightof the corresponding pins. Cluster 5 is distinguished by a highvalue of hold, the holding time; cluster 18 has an unusuallyhigh slope, indicating that it is a growth cluster.

The plot at the bottom of the figure is a parallel coordinatesplot [15] of the cluster means. In this multivariate plot,attribute axes are parallel rather than orthogonal, and eachrecord (cluster) is described by line segments connecting itsvalue on each parallel axis. Clusters 3 and 6 are highlighted:note how different their values are on mp peak, the proportionof use in peak hours. This picture is a screen shot of a windowin the ggobi software [14].

C. Longitudinal Profiles of Clusters

A common approach to visualize clusters is to plot the rawdata or time series for each individual cluster. This is fine forsmall clusters, and ideal for striking outliers like the singletonshown in Figure 7. However, for clusters as large as some ofours, the resulting plots are indecipherable. Instead, we plotrepresentative representative summaries such as percentiles ateach time point.

Figure 6 shows the longitudinal profiles of total usagefor the two largest clusters. We have plotted the mean, themedian, and selected percentiles, using an appropriate variabletransform to make the plots legible while maintaining therelative relationship. The mean is represented by the darksolid line while the percentiles are represented by the broken

• • • • • • • • • • • • • • • • • • • • • • • • • • • • • • ••

•

• •

•

Weeks0 10 20 30

0.0

0.5

1.0

1.5

Fre

quen

cy o

f use

(m

illio

ns)

Fig. 7. Time series of the weekly frequency of use for an outlying clusterof size 1. The two large spikes towards the end of the series, especially thefinal point, are a direct result of two of the Florida hurricanes in 2004. Thistime series was so different from any other that it warranted a cluster all byitself.

lines. Notice that the the mean on each plot is very high,indicating a highly skewed distribution. The lower quantilesare almost coincident indicating a big mass at the lower valuesof the distribution of the attribute. We can see the two “V”-shaped indentations we mentioned earlier corresponding toholidays. The cluster in the leftmost plot represents the largestcustomers, characterized by high and gently declining usage.The cluster in the rightmost plot describes the second biggestcluster. These are new users that have started using the serviceduring our 36 periods of observation. Since they are new users,the growth of their usage is sometimes slow and gradual,accounting for a big concentration at lower usage values earlyin the time series. However, once they start using the service,usage may increase rapidly.

Longitudinal summaries based on percentiles are a conve-nient way of viewing the time series behavior of the cluster.Since the percentiles have an ordering, they never cross eachother, making the plots easy to interpret. The mean, of course,is not constrained in this fashion, as can be seen in thelongitudinal profile of cluster 6.

D. Outliers

Figure 7 shows the behavior of an outlying cluster of size 1.This time series was so extreme and different from anythingelse that it occupied a cluster all by itself. Further investigationrevealed that the two spikes in usage were generated asa consequence of the hurricanes in Florida in 2004. Othersmaller clusters exhibited such startlingly dissimilar behavior,each precipitated by abnormal circumstances of some kind.

V. FURTHER RESEARCH

The bulk of our further research is focussed on clusterstability and migration. In characterizing time series clusters,we may want to learn whether the clusters we have identifiedare persistent over time with the same qualitative interpreta-tion. If these clusters are stable, do data points migrate from


Cluster 5 (1303TNs)

t_fre

qslo

peho

ld

mp_

mod

mp_

peak

s_ch

ng_p

m_f

req_

diff

m_m

ts_dif

f

Cluster 18 (42TNs)

t_fre

qslo

peho

ld

mp_

mod

mp_

peak

s_ch

ng_p

m_f

req_

diff

m_m

ts_dif

f

Fig. 5. Cluster profiling. The two “pin plots” at the top show profiles of two clusters. We put a large number of plots on a page, one for each cluster, andwe compare the shape of one plot to another. Cluster 5 is most clearly distinguished from the other clusters based on hold, while Cluster 18 is most clearlydistinguished by slope. The parallel coordinates plot essentially draws all the pin plots on top of each other, adding line segments between the heads of thepins. The traces for the two largest clusters, 3 and 6, have been highlighted, and several points have been labeled. Because of overplotting, parallel coordinatesplots are best used in a direct manipulation setting, with brushing and labeling; this picture is a screen shot from the ggobi software.

cluster to cluster? Do the clusters represent different stages insome typical trajectory that a user takes over a long period ofobservation? In a typical trajectory, a user starts off as “New”and proceeds to a “Growth” cluster. Depending on the usage,the user might migrate to a steady state of high use outliergroup, or go into a gentle decline and eventually fade out.Some users might start off as high usage outliers and stay thatway. From time to time, we might observe evanescent clustersthat arise from abnormal circumstances experienced by theuser, the service or the environment (for example, naturaldisasters).

If a handful of canonical states and trajectories exist and arestable over time, we can empirically quantify the likelihoodof making a transition from one cluster to another. A recentpaper [7] addresses a similar issue in the context of spatial

data. There are potentially many approaches to solving thegeneral problem of cluster migration, ranging from Markovprocesses [11] to counting processes and survival models [1].

We build on the methodology developed in this paper —fast, transparent and flexible methods based on nonparametricstatistical summaries; user validation and interpretation; infor-mation visualization at each step of the way.

Our experience with the current data set indicates thatcanonical clusters exist, are persistent over time, and arestable with a consistent interpretation. Our outliers are typ-ically evanescent clusters that show up when abnormal eventshappen. When the effect of the event fades out they fall backinto one of the permanent clusters. Furthermore, the canonicalclusters do represent certain stages in the natural evolution ofthe user of the service, and we do see meaningful migration


0 5 10 15 20 25 30 35

05

1015

2025

30

Cluster 3 (1152838)

Week

Sqr

t of m

ean

and

pctil

es o

f t_f

req

95th percentile

Mean

90th percentile

75th percentile

Median

10th percentile

0 5 10 15 20 25 30 35

01

23

45

6

Cluster 6 (519363)

Week

Sqr

t of m

ean

and

pctil

es o

f t_f

req

95th percentile

Mean

90th pctile

Median

Fig. 6. Longitudinal profiles of the two largest clusters. The leftmost plot shows the profile of the biggest cluster, Cluster 3, with respect to total frequencyof usage. We have plotted the square root of the mean, the median and selected percentiles; the mean is represented by the dark solid line. Notice that the themean is between the

th and th percentiles, while the lower quantiles are almost coincident, all of which indicates a very skewed distribution. Cluster 3

represents steady usage over the time period with a gentle decline. The rightmost plot shows the longitudinal profile of Cluster 6, the second biggest cluster.We see an even more skewed distribution, with the mean rising above the

th percentile by about Week 30. We also see increasing usage: These are newusers that have started using the service during our observation period.

between clusters.

VI. CONCLUSION

In this paper, we have developed a methodology to guideusers through the task of clustering massive, multivariate timeseries data. Our goal is to rapidly group the data into a usefulset of representative clusters and outliers. We use informationvisualization to understand, interpret and communicate theresults. Our methodology is widely applicable, flexible, fastand can easily handle millions of records. It can be customizedto any application or skill level.

Using our method, we have grouped 1.67 million multi-variate time series into representative clusters and outliers.We characterized and interpreted the nature of the clustersand outliers using elegant information visualization methodsranging from bubble plots to parallel coordinate plots. Bestof all, our method is accessible to the wide range of domainexperts who use this data and this methodology.

At first glance, this methodology might seem ad hoc, but weclaim that we have found a balance between automation andreliance on human judgement and expertise that is ideal forour problem. We need to find useful, interpretable clusters inmassive data, and finding clusters that are optimally separatedtakes a lot longer and provides us with no real advantage. Fur-thermore, the data sets we analyze are typically heterogeneous,complex and evolving, due to changes in business paradigmsor data processing conventions: We rarely encounter exactlythe same data set twice. We need a strong exploratory flavorto discover these undocumented changes.

REFERENCES

[1] P. Andersen, O. Borgan, R. Gill, and N. Keiding. Statistical ModelsBased on Counting Processes. Springer, New York, 1992.

[2] A. J. Bagnall and G. Janacek. Clustering time series from ARMA modelswith clipped data. In ACM SIGKDD, pages 49–58, 2004.

[3] C. Cortes and D. Pregibon. Signature-based methods for data streams.Data Mining and Knowledge Discovery, 5:167–182, 2001.

[4] J. A. Hartigan. Clustering Algorithms. John Wiley & Sons, Inc., 1975.[5] A. Hinneburg and D. A. Keim. Clustering techniques for large data sets:

From the past to the future. In ACM SIGKDD, pages 141–181, 1999.[6] T. Johnson and T. Dasu. Comparing massive high-dimensional data sets.

In ACM SIGKDD, pages 229–233, 1998.[7] C. Lai and N. T. Yeung. Predicting density-based spatial clusters over

time. In IEEE Conference on Data Mining, pages 443–446, 2004.[8] J. Lin, E. Keogh, S. Lonardi, J. P. Lankford, and D. M. Nystrom.

Visually mining and monitoring massive time series. In ACM SIGKDD,pages 460–469, 2004.

[9] G. S. Manku, S. Rajagopalan, and B. G. Lindsay. Approximate mediansand other quantiles in one pass and with limited memory. In ACMSIGMOD, pages 426–35, 1998.

[10] R Development Core Team. R: A language and environment forstatistical computing. R Foundation for Statistical Computing, Vienna,Austria, 2004. ISBN 3-900051-00-3.

[11] L. R. Rabiner and B. H. Juang. An introduction to hidden Markovmodels. IEEE ASSP Magazine, pages 4–15, January 1986.

[12] W. S. Sarle. Cubic clustering criterion. SAS Technical Report A-108,Cary, NC: SAS Institute Inc., 1983.

[13] SAS Institute Inc. SAS/STAT Software: User’s Guide, Volume 1, Version6.11. 1990.

[14] D. F. Swayne, D. Temple Lang, A. Buja, and D. Cook. GGobi:evolving from XGobi into an extensible framework for interactive datavisualization. Computational Statistics & Data Analysis, 43:423–444,2003.

[15] E. J. Wegman. Hyperdimensional data analysis using parallel coor-dinates. Journal of the American Statistical Association, 85:664–675,1990.

[16] Y. Xiong and D.-Y. Yeung. Mixtures of ARMA models for model-basedtime series clustering. In IEEE Conference on Data Mining, pages 717–720, 2002.


Topographical Proximity: Exploiting Domain Knowledge for

Sequential Data Mining

Ann Devitt

Ericsson

Dublin 4

Ireland

[email protected]

Joseph Duffin

Ericsson

Dublin 4

Ireland

[email protected]

Abstract

In today’s mobile telecommunications networks, in-creasingly powerful fault management systems are re-quired to ensure robustness and quality of service ofthe network. In this context, fault alarm correlation isof prime importance to extract meaningful informationfrom the vast quantities of alarms generated by the net-work. Existing sequential data mining techniques ad-dress the task of identifying possible correlations in fre-quent sequences of telecoms alarms. These frequent se-quence sets, however, may contain sequences which arenot plausible from the point of view of network topol-ogy constraints. This paper presents the TopographicalProximity (TP) approach which exploits the topograph-ical information encoded in telecommunication alarmsin order to address this lack of plausibility in minedalarm sequences. An evaluation of the quality of minedsequences is presented and discussed. Results show animprovement in overall system performance for impos-ing proximity constraints.

1 Introduction

Given the growing complexity of mobile telecommu-nications networks, the task of ensuring robustness andmaintaining quality of service in the network requiresincreasingly powerful network management systems.Furthermore, the steady increase in size and complex-ity of the network produces a corresponding increasein the volume of data generated by network elements(e.g. alarms, performance indicators) placing addedstrain on management systems. In particular, the areaof fault management remains a key problem area fornetwork operators, as the speed at which faults arehandled has very immediate consequences for network

performance. The complex, inter-connected nature ofthe network means that a single fault may producea cascade of alarms from affected network elements.Conversely, intermittent, self-clearing alarms may beraised without any attendant fault in the network. Inthis context, event correlation provides a means of deal-ing with the large volume of alarm data. Correlationsdefine relations between alarm events that facilitate theprocesses of alarm filtering, masking and prioritisingspecified in ITU-T recommendations [8]. While se-quential data-mining techniques have evolved to iden-tify possible useful correlations in alarm data, the taskof identifying the subset of important and plausiblecorrelations remains heavily dependent on the domainexpertise of network equipment manufacturers and op-erators. Yet alarms encode substantial domain knowl-edge, in particular topographical information regardingthe network elements which generated a given alarm.Furthermore, telecommunications networks, althoughcomplex, conform to a well-defined topology of net-work elements. This paper addresses the challenge ofharnessing the latent domain knowledge available inalarm data in order to provide criteria for automati-cally evaluating the plausibility of mined alarm corre-lations. Section 2 sets out current approaches in thedomain of sequential data-mining addressing the taskof event correlation. Section 3 describes the need toexploit topographical attributes of the input data tovalidate mined sequences and how this has been re-alised for telecommunications alarm data as the Topo-graphical Proximity (TP) measure. Section 4 describesa set of experiments aimed at providing a qualitativeevaluation of the topographical proximity approach formining telecommunications alarm data. The resultsare presented and discussed in section 5.


2 Sequential Data Mining

Telecommunications alarm data is inherently tem-poral and sequential in nature, consisting of a series oftimestamped events. The specific problem of identify-ing relationships between events in a sequential datasetcan be viewed as a subset of the problem of miningfor associations between dataset elements in general,constrained by the temporal aspects of the data. Thedomain of sequential data mining addresses this prob-lem space with the objective of finding noteworthy se-quences of events or sequential patterns that suggestrelationships between constituent events. In theory,the notion of noteworthiness may be task–specific. Inpractice, however, a sequence which is noteworthy of-ten equates to a sequence which occurs frequently inthe input data. However, frequency as the sole measureof sequence “noteworthiness” is not a valid measure fornetwork alarm data where frequency may indicate re-dundancy. The research presented here is motivated bythe need to establish novel criteria for pattern selectionin sequential data mining.

Much of the foundation work in sequential miningtechniques shares a common historical origin in theApriori association rule mining algorithm for transac-tion data [2]. Apriori is based on the assumption that afrequent sequence of elements must consist of elementswhich are themselves frequent. The algorithm gener-ates a set of frequent sequences by iterating through a“generate and count” process, generating candidate se-quences of increasing length and pruning the set basedon sequence frequency or support (i.e. normalised fre-quency) values. Candidates are generated by a processof merging two existing sequences of length n − 1 togive a sequence of length n, as in example 1.

ABC + ABD => ABCD (1)

The WINEPI [9] and GSP [11] algorithms were amongthe first to adapt the Apriori technique to mine for tem-poral association rules in sequential data. Both employa sliding time window with a user-specified duration totraverse the input data, extracting sequences accordingto user-specified minimum and maximum sequence du-ration constraints. Although the basic premise for thetwo algorithms is the same, they differ in many designand implementation details. The GSP algorithm wasdesigned for mining transaction data and, therefore,incorporates extra transaction-based constraints on vi-able candidate sequences. Furthermore, GSP events oritems may be organised in a taxonomy allowing eventsor their superordinates in the taxonomy to be usedfor calculating support values or generating candidatesequences. WINEPI, on the other hand, is optimised

for flat sequential data, like telecoms alarm data andaddresses the issue of full or partial ordering of eventsequences.

Other Apriori-based approaches aim to optimiseperformance within the same conceptual framework.MINEPI [9] is an extension of the WINEPI algorithmwhich optimises space and time constraints by com-pressing event sequences to their minimal occurrencewindow. FreeSpan [6] focuses on the candidate gen-eration process employing a database of projected se-quence extensions to ensure that the system only gen-erates candidates that exist in the data. Its extensions,PrefixSpan [10] and IncSpan [3], modify the projecteddatabase structure and access to optimise the depth-first search of possible candidate sequences. SPADE[15] decomposes the search space and uses lattice-basedsearch strategies to optimise performance. Other algo-rithms impose further constraints on the apriori min-ing approach: the constraint-based extension of theSPADE algorithm, cSPADE [14], imposes various syn-tactic constraints on mined sequences or SPIRIT [5]mines for sequences which match user-specified regularexpression constraints.

Apriori-based approaches assume that the aim is toidentify highly frequent patterns. Other approachesare designed to extract sequences according to differ-ent criteria. Weiss [13] describes a supervised machinelearning system using genetic algorithms where the ob-jective is to predict rare, rather than frequent, equip-ment failures events on the basis of alarm sequences,candidates sequences are generated by a combinationand/or mutation process. Heierman et al [7] use peri-odicity and length of sequences as well as frequency intheir candidate selection process. Sterritt [12] presentsa hybrid approach which combines genetic algorithmsand Bayesian belief networks to derive structures basedon sequences with a strong cause and effect relation-ship. The research set out below is based on an Aprioriapproach but introduces a novel criterion for sequenceselection which evaluates sequence plausibility and co-herence in terms of network topology.

3 Topographical Proximity

The algorithms outlined in section 2 are capable ofefficiently extracting thousands of event sequences insequential input data. Therefore, post-processing re-mains an essential component of a usable mining sys-tem whereby sequences which are deemed to be unin-teresting because they are redundant or simply implau-sible are eliminated from the output. The TopologicalProximity (TP) approach introduced in this paper con-stitutes a means of determining the plausibility of a


correlation between events in mined sequences at run-time of the mining process. The algorithm quantifieshow closely alarm-generating elements are connectedto each other in terms of the logical structure of a net-work using topographical information extracted fromthe alarms themselves. The general assumption is thatthe more closely connected the alarm-generating ele-ments, the more plausible and hence interesting therelationship between the alarms and the greater likeli-hood that there is some cause and effect relationshipbetween them. At runtime, a measure of Topograph-ical Proximity is used to reject or promote candidatesequences on the basis of their connectedness. Not onlydoes this ensure that the output sequence set is plau-sible within the context of the network, but the spaceand time constraints of the data mining process areoptimised as the algorithm uses both frequency andproximity to reduce the dimensions of the candidatesequence set, thereby restricting the search space ofpossible correlations. The measure may also be usedduring post-processing to rank sequences in terms ofthe connectedness of their constituent alarm events.Section 3.1 outlines how the TP measure is calculatedbased on a generic network topology. Section 3.2 de-scribes how the measure has been integrated into thesequential mining process.

3.1 TP Calculation algorithm

The TP algorithm calculates the logical distance be-tween alarm-generating network elements. The valuehas a minimum of zero for nodes that have no logicalconnection in the network and a maximum of one fornodes that have a very clear and close connection. TPcalculation is based on the Radio Access Network ofa standard UMTS telecommunication network whichconsists of functional nodes connected by communica-tion interfaces and arranged in a logical, hierarchicalstructure, represented by the simplified schema in fig-ure 1.1 Each node in this system has functional sub-components which may generate fault alarms which arethen communicated to a designated Radio Access Net-work Management Node via a standard interface. Nodesubcomponents represent a node’s internal functional-ity, the functionality of the interfaces between nodesor logical communications artefacts. The position ofan alarm-generating node in the hierarchical structure

1The TP calculation algorithm, however, is valid for any net-work which consists of functional nodes connected by interfacesand arranged in a logical structure. In [4], we describe how prox-imity values may be predefined as constants and assigned on thebasis of shared and disjunct topographical information of alarm-generating nodes.

is encoded in its full distinguished name, included inthe source node attribute of each alarm.

MasterNode1 MasterNode2

ParentNode1_1 ParentNode1_2 ParentNode2_1 ParentNode2_2

Child1 Child2 Child3 Child4 Child5 Child6 Child1A Child2A Child3A Child4A Child5A Child6A

Network

Figure 1. Simplified telecommunications net-work schema

In the context of this hierarchical network, the topo-graphical proximity value for network elements on thesame branch of the network is automatically assignedthe maximum value of 1, to reflect the direct descen-dancy relation between the network elements, for ex-ample between Child1 and ParentNode1 1 or ParentN-ode2 2 and MasterNode2 in figure 1. For network ele-ments that are not on the same branch of the network(e.g. Child5A and Child1 in figure 1), the topograph-ical proximity value equates to a weighted traversal ofthe network branches or edges between two network el-ements. The TP value represents the total number ofedges that must be traversed to find a path between thetwo elements. The weighting reflects the assumptionthat some hierarchical relations are closer than others.For the purposes of this analysis, logically, Child nodesform tighter clusters around Parent nodes than Parentnodes around Master nodes. This reflects the assump-tion that alarms on elements lower in the hierarchy maybe more likely to share a common cause. Thus, nodesChild1 and Child2 in figure 1 are deemed closer in thecontext of the network than nodes ParentNode1 1 andParentNode1 2.

For any two alarms, the source node attribute ofeach alarm is parsed to give the inheritance hierarchyof the network element with which that alarm is asso-ciated. The Topographical Proximity value for the twoassociated network elements is then calculated accord-ing to algorithm 1. Examples 2 to 4 below providesome sample TP values based on the network elementsin figure 1.

TPcalculation(Child1, Child3) = 0.8 (2)


Algorithm 1 TP calculation algorithm

Input: 2 network elements, E1 and E2Output: TP value, 0 ≤ TP ≤ 1

TP = 0if sameBranch(E1, E2) then

Return 1end if

if sharedParentNode(E1, E2) then

TP+ = 0.4end if

if sharedMasterNode(E1, E2) then

TP+ = 0.35end if

if sharedNetwork(E1, E2) then

TP+ = 0.05end if

Return TP

TPcalculation(Child1, Child6) = 0.4 (3)

TPcalculation(Child1, Child1A) = 0.05 (4)

3.2 Integration of TP to the mining algorithm

The current implementation integrates the Topo-graphical Proximity approach with the candidate gen-eration component of the MINEPI algorithm [9].MINEPI generates candidate sequences of length n bycombining two existing sequences of length n − 1 andstores the minimal, or most compact, occurrences of allfrequent sequences for subsequent iterations. Our algo-rithm filters all occurrences of candidate sequences onthe basis of their connectedness within the network, asrepresented by the TP value calculated for the alarm-generating network elements. This filtering can be im-plemented in one of two ways:

1. Store minimal occurrences of all sequences abovea given TP threshold;

2. Store the occurrences with the highest TP valueof all sequences.

In the first case, the space constraints of the systemare optimised for sequence compactness, in the secondfor sequence connectedness. In order to compare theperformance of the original Minepi algorithm with thatof the Topographical Proximity approach, the experi-ment reported in section 4 take the first approach usingthe TP value to prune the candidate set rather than toexplicitly optimise sequence storage. The final step, aswith Minepi, prunes the remaining candidate set basedon a support (i.e. frequency) threshold.

Each minimal occurrence of a sequence has an as-sociated proximity value. For sequences of length two,the TP value is calculated according to algorithm 1.For longer sequences, the TP value is the mean of theTP values for the two existing occurrences to be mergedand the proximity value calculated for the source nodesof the first and last alarms of the new candidate, as inalgorithm 2. For example, candidate sequence 7 below

Algorithm 2 calculateSequenceTP

INPUT: seq, alarm1, alarm2 . . . alarmnOUTPUT: TPvalueif length(seq) == 2 then

return calculateTP (alarm1, alarm2)else

TPseq1 = Retrieve from memory TPalarm1···(n−1)

TPseq2 = Retrieve from memory TPalarm2···n

TPnew = calculateTP (alarm1, alarmn)

returnTPseq1+TPseq2+TPnew

3

end if

is composed of subsequences 5 and 6.2

Seq 1 = Child1, Child3, MasterNode1 (5)

Seq 2 = Child3, MasterNode1, Child1A (6)

Seq = Child1, Child3, MasterNode1, Child1A (7)

TPSeq, the TP value for the new candidate sequenceSeq, is calculated as follows, where the only new TPcalculation evaluates the connection between Child1and Child1A:

TPSeq =TPSeq1 + TPSeq2 + TPcalc(Child1, Child1A)

3(8)

The added cost of the TP computation is minimal asfor each occurrence of a new candidate sequence, onlyone new TP calculation is carried out. Furthermore,the cost is offset by the reduction in the search spaceof candidate sequences at each iteration achieved byimposing a minimum TP value threshold. Unlike asupport threshold, the TP threshold is not an arbi-trary means of reducing the set of candidate sequences.The TP threshold can be set to reflect domain expertsintuitions regarding what connections constitute plau-sible sequences in their network. A support thresholdis imposed after the TP threshold but the frequencyconstraint can be more flexible given the candidatesequences set is pre-pruned for proximity. Section 5explores how the use of the topographical proximity

2For the purposes of illustrating the TP calculation, thealarms in the sample sequences are represented by their sourcenodes. The examples refer to the simplified network in figure 1.


threshold interacts with the standard mining parame-ters of maximum sequence duration and minimum sup-port value to obtain optimum results in a qualitativeevaluation of mined sequences.

4 Experiments

A set of experiments was conducted in order to pro-vide a qualitative evaluation of the mining algorithm atdifferent topographical proximity thresholds. To date,research has tended to focus on system performance,justifiably given the intensive computation involved inthe mining process. What has been notably lacking,however, is an evaluation of the quality of the minedsequences. The experiment described below aims toaddress this shortfall. To this end, the mining task hasbeen formulated as one of identifying specific targetsequences in the data. The experiment was run on aPentium 4 3.2 GHz processor with 2 GB of RAM run-ning Microsoft Windows XP Professional version 2002.

4.1 Test Cases

For the purposes of this experiment, the time win-dow and minimum support system parameters weretested within the ranges of 60-600 seconds at 60 sec-ond intervals and 25-175 occurrences at intervals of25, respectively. This gives a total of 70 test cases(10 time windows ∗ 7 support values) for eachTopographical Proximity (TP) threshold value. Foreach time window and support parameter combination,baseline system performance of Minepi without Topo-graphical Proximity (TP = 0) was calculated. Six fur-ther test cases for each parameter combination wereevaluated at TP = 0.5, 0.6, 0.7, 0.8, 0.9, 1. The aimwas to determine optimum system parameters and TPthreshold values from the 490 (10∗ 7 ∗ 7) test cases andto establish whether the imposition of a TP thresholdimproved the quality of the output sequence set.

4.2 Methodology

Most commercially available alarm managementsystems are fully dependent on the expertise and ex-perience of network analyst to derive rules for filteringand correlating alarms. This experiment aims to pro-vide a global measure of the quality of the performanceof the mining algorithm evaluated in the context of thedomain knowledge of such experts. This objective hasbeen formulated as the task of identifying in live net-work data common alarm sequences specified by net-work analysts.

Dataset. The basic dataset for the experiments con-sists of 96,991 alarms from the Radio Access Net-work (RAN) of a live telecommunications network.Thealarm format conforms to telecoms standards [1] andincludes a timestamp with a granularity of millisecondsand thirteen attributes relating to four broad categoriesof alarm timing, event lifecycle, alarm type and alarmsource details.

Target Sequences Set. The quality of the outputfrequent event sequences must be evaluated relativeto the frequency of known event sequences in the in-put data. In order to compile a target set of eventsequences, a detailed statistical analysis of the alarmdata was conducted by network experts. The analy-sis focused on the most frequently occurring individ-ual alarms in order to identify repeating alarms andsuspected correlations among the frequent alarm set.The results was a target set of twenty event sequencesconsisting of eighteen repeating alarm sequences, nineof length two events and nine of length four, and twointer-event correlations of length two and four. Thisset of twenty sequences represents a baseline of goldstandard sequences which experts extrapolate from thedataset and which the algorithm should identify in thedataset.

Procedure. The mining algorithm was run on thedataset of 96,991 alarms for the 490 test cases set outin section 4.1. For each test case, three performancemetrics were calculated based on the number of tar-get sequences from the set of twenty target sequencesidentified for these parameters and threshold values.

4.3 Performance Metrics

The metrics used to determine performance in theexperiment reported below are the measures of pre-cision and recall borrowed from the Information Re-trieval domain. In the context of this mining experi-ment, the measures are defined as follows:

• Precision: the number of correctly identified tar-get sequences relative to the total number of se-quences found by the system.

Precision =Number of target sequences foundTotal number of sequences found

• Recall: the number of correctly identified targetsequences relative to the total number of targetsequences.

Recall = Number of target sequences foundNumber of sequences in the target set


A high precision value indicates that the algorithmis selective and does not identify many spurious se-quences. A high recall value indicates that the algo-rithm is accurate, successfully identifying most of thetarget sequences. These two metrics are combined togive a single indicator of system performance, the FScore representing the trade-off between these two in-dicators of precision and accuracy. A high F Scorevalue indicates that the algorithm is both selective andaccurate with respect to the target sequence set. The FScore is calculated according to the following formula:

• FScore = 2 * Precision * RecallPrecision + Recall

The performance metrics were calculated for perfectmatches of target sequences identified by the system.They focus on the performance of the mining algorithmin terms of its ability to identify patterns known to ex-ist in the data while restricting these patterns to oneswhich represent plausible connections in a telecommu-nications network. Results are presented and discussedin section 5.

5 Results

In order to isolate the impact of the TopographicalProximity value on system performance in this exper-iment, the effects of the time window and support pa-rameters were analysed. The ten graphs in figure 2 il-lustrate performance for each of the ten time windowsfrom 60 to 600. Each graph plots the three perfor-mance metrics of precision, recall and F Score for allTP value thresholds: the Minepi baseline (TP = 0) andTP = 0.5, 0.6, 0.7, 0.8, 0.9, 1. The figure illustratesthat there is little variation in performance across theten time windows. This would strongly suggest thatwindow size is not a significant factor in the task ofidentifying target sequences. This can be attributed tothe fact that the sequences are short in duration andtherefore should be identified at all window sizes above60 seconds. Figure 2 presents the results derived usinga support threshold of 100 but results at all supportthresholds exhibit the same characteristics.

The minimum support parameter, however, hasa much greater effect on sequence identification,as illustrated in figure 3. The seven subplotsdemonstrate system performance for support values25, 50, 75, 100, 125, 150, 175 at a time window of 240seconds. The plots show quite different behaviour forthe seven support thresholds. It is therefore in the con-text of these seven experimental conditions representedby the seven support thresholds that we evaluate theeffect of the TP value on system performance.

Figure 3 demonstrates a clear trend across all sup-port value thresholds: as the TP threshold increases,there is a decrease in recall with a corresponding in-crease in precision, giving an overall increase in F Scorevalue. This trend reflects the trade-off between repro-ducing the target sequence set in the output and gen-erating a more restricted and, therefore, precise set ofoutput sequences. The trade-off is such that, despitethe reduction in recall values, overall performance, rep-resented by the FScore value, improves as higher TPthresholds are enforced. This result validates expec-tations that restricting the sequence selection processto only accept topographically plausible sequences willsignificantly reduce the number of spurious sequencesidentified, thereby reducing the search space at runtimeand facilitating post-processing. Furthermore, the re-duction in recall values, particularly for TP ≥ 0.7, maybe addressed by employing the second strategy outlinedin § 3.2 of optimising sequence storage with referenceto proximity rather than sequence duration.

The results reported here would suggest that the useof the topographical proximity value yields a favourabletrade-off between accuracy and recall for sequentialdata mining of telecoms alarm data. Furthermore, wewould suggest that the output sequence set for higherTP thresholds more accurately represent the opinionof domain experts that:

• interesting correlations occur on related or con-nected nodes;

• frequency alone may not be an appropriate cri-terion for identifying noteworthy sequences intelecommunications data.

6 Future Work

The research reported in this paper suggests twocomplementary directions for future work. The firstis to extend the topographical proximity measure toa broader sequence validation methodology. This canbe addressed by identifying those attributes of indi-vidual alarms, which are significant, not for classify-ing individual alarms into types, but at the sequencelevel for validating alarm sequences. Future imple-mentations aim to exploit attributes such as alarmseverity and probable cause to generate a more refinedmeasure of sequence plausibility by which to constrainthe sequence generation process. Furthermore, the al-gorithm described in this paper assumes a simplifiedand homogeneous network topology. This is an over-simplification which needs to be addressed in futuredevelopment by exploiting other explicit connectionswithin the telecoms network.


0 0.5 10

0.2

0.4

0.6

0.8

1MaxTime=60

Per

form

ance

Met

rics

0 0.5 10

0.2

0.4

0.6

0.8

1MaxTime=120

0 0.5 10

0.2

0.4

0.6

0.8

1MaxTime=180

0 0.5 10

0.2

0.4

0.6

0.8

1MaxTime=240

0 0.5 10

0.2

0.4

0.6

0.8

1MaxTime=300

0 0.5 10

0.2

0.4

0.6

0.8

1MaxTime=360

TP threshold

Per

form

ance

Met

rics

0 0.5 10

0.2

0.4

0.6

0.8

1MaxTime=420

TP threshold0 0.5 1

0

0.2

0.4

0.6

0.8

1MaxTime=480

TP threshold0 0.5 1

0

0.2

0.4

0.6

0.8

1MaxTime=540

TP threshold0 0.5 1

0

0.2

0.4

0.6

0.8

1MaxTime=600

TP threshold

PrecRecFScore

Figure 2. Performance metrics by TP threshold, 60 ≤ timeWindow ≤ 600, support = 100.

The second key extension to the current researchregards the qualitative evaluation of sequential miningalgorithms. The experiment described above infers atarget sequence set from domain experts analysis ofthe input data. Devitt et al [4] describe an experimentwhich uses a silver standard of alarm data with syn-thetic sequences inserted in known quantities and dis-tributions into the data. A further step would requirethe development of a gold standard dataset for tele-coms alarm data where all significant and interestingcorrelations have been tagged in the data by domainexperts.

7 Conclusions

The main contribution of this paper is to intro-duce the Topographical Proximity (TP) approach forsequential mining of telecommunications alarm data.This measure exploits the topographical informationencoded in alarms to validate all candidate sequences

at run-time with respect to the plausibility of the pos-sible correlation they represent. The second significantcontribution is to provide a qualitative evaluation ofthe performance of the mining algorithm. The evalu-ation results strongly suggest that the performance ofthe mining algorithm improves with the inclusion ofthe TP measure.

References

[1] 3GPP. 3rd generation partnership project techni-cal specification group services and system aspects.Telecommunication management. Fault Management.Part 2: Alarm Integration Reference Point (IRP), In-formation Service (IS), (Release 6) 3GPP TS 32.111-2V6.3.0, 3GPP, 2004.

[2] R. Agrawal, T. Imielinski, and A. N. Swami. Min-ing associations between sets of items in massive data-bases. In Proceedings of the ACM-SIGMOD 1993 In-

ternational Conference on Management of Data, pages207–216, Washington, D.C., may 1993.


0 0.5 10

0.2

0.4

0.6

0.8

1Freq=25

Per

form

ance

Met

rics

0 0.5 10

0.2

0.4

0.6

0.8

1Freq=50

0 0.5 10

0.2

0.4

0.6

0.8

1Freq=75

0 0.5 10

0.2

0.4

0.6

0.8

1Freq=100

0 0.5 10

0.2

0.4

0.6

0.8

1Freq=125

TP threshold

Per

form

ance

Met

rics

0 0.5 10

0.2

0.4

0.6

0.8

1Freq=150

TP threshold0 0.5 1

0

0.2

0.4

0.6

0.8

1Freq=175

TP threshold

PrecisionRecallF Score

Figure 3. Performance metrics by TP threshold, 25 ≤ support ≤ 175, timeWindow = 240.

[3] H. Cheng, X. Yan, and J. Han. Incspan: Incrementalmining of sequential patterns in large databases. InProceedings of KDD 2004, 2004.

[4] A. Devitt, J. Duffin, and R. Moloney. Topographicalproximity for mining network alarm data. In Proc. of

MineNet’05, SIGCOMM 2005, pages 179–184, 2005.

[5] M. N. Garofalakis, R. Rastogi, and K. Shim. Spirit:Sequential pattern mining with regular expression con-straints. In Proc.of VLDB’99, Scotland, 1999.

[6] J. Han, J. Pei, Y. Yin, and R. Mao. Mining frequentpatterns without candidate generation: A frequent-pattern tree approach. Data Mining and Knowledge

Discovery, 8(1):53–87, 2004.

[7] E. O. Heierman, G. M. Youngblood, and D. J. Cook.Mining temporal sequences to discover interesting pat-terns. In Proceedings of KDD 2004, Workshop on min-

ing temporal and sequential data, 2004.

[8] ITU. Itu-t recommendations: M.3030 principles for atelecommunication management network, 1988.

[9] H. Mannila, H. Toivonen, and A. I. Verkamo. Dis-covery of frequent episodes in event sequences. Data

Mining and Knowledge Discovery, 1:259–289, 1997.

[10] J. Pei, J. Han, B. Mortazavi-Asl, J. Wang, H. Pinto,Q. Chen, U. Dayal, and M.-C. Hsu. Mining sequentialpatterns by pattern-growth: The prefixspan approach.IEEE Transactions on Knowledge and Data Engineer-

ing, 16(10), 2004.[11] R. Srikant and R. Agrawal. Mining sequential pat-

terns: Generalizations and performance improve-ments. In Proceedings of the 5th International Confer-

ence on Extending Database Technology, EDBT, 1996.[12] R. Sterritt. Discovering rules for fault management.

In Proc. of ECBS’01, pages 190–196, Apr. 2001.[13] G. M. Weiss. Predicting telecommunication equip-

ment failures from sequences of network alarms. InW. Kloesgen and J. Zytkow, editors, Handbook of

Knowledge Discovery and Data Mining. Oxford Uni-versity Press, 2002.

[14] M. J. Zaki. Sequence mining in categorical domains:Incorporating constraints. In Proc. of CIKM’00, pages422 – 429, 2000.

[15] M. J. Zaki. Spade: An efficient algorithm for miningfrequent sequences. Machine Learning, 42(1-2):31–60,2001.


Mining Spatio-Temporal Association Rules, Sources, Sinks, Stationary Regionsand Thoroughfares in Object Mobility Databases

Florian Verhein and Sanjay ChawlaSchool of Information Technologies

University of SydneyNSW, Australia

fverhein, [email protected]

Abstract

As mobile devices proliferate and networks becomemore location-aware, the corresponding growth in spatio-temporal data will demand analysis techniques to mine pat-terns that take into account the semantics of such data. As-sociation Rule Mining (ARM) has been one of the moreextensively studied data mining techniques, but it consid-ers discrete transactional data (supermarket or sequential).Most attempts to apply this technique to spatial-temporaldomains maps the data to transactions, thus losing thespatio-temporal characteristics. We provide a comprehen-sive definition of spatio-temporal association rules (STARs)that describe how objects move between regions over time.We define support in the spatio-temporal domain to effec-tively deal with the semantics of such data. We also intro-duce other patterns that are useful for mobility data; sta-tionary regions and high traffic regions. The latter con-sists of sources, sinks and thoroughfares. These patternsdescribe important temporal characteristics of regions andwe show that they can be considered as special STARs. Weprovide efficient algorithms to find these patterns. Partic-ularly, via several pruning properties we can mine STARsefficiently by first mining high traffic regions.

1 Introduction

As mobile devices proliferate, networks becomelocation-aware and GPS sensors become more common, theneed to manage, analyze and mine spatio-temporal data willonly grow.

Some specific examples of current and future applica-tions which will require mining and analysis of spatio-temporal data include managing cell phone networks anddealing with the data generated by Radio-Frequency Iden-tification (RFID) tags. Mining such data could detect pat-

terns for applications as diverse as intelligent traffic man-agement, sensor networks, stock control and wildlife mon-itoring. For example, consider the movement of users be-tween cells of a mobile phone (or similar) network. Beingable to predict where users will go would make cell handover decisions easier and improve bandwidth management.Also, since most people own a mobile phone these days,the data could be used for fast and inexpensive populationmovement studies. Local governments would find the abil-ity to answer questions such as “how much is this park be-ing used?”, “which areas are congested?”, and “what are themain routes that people take through the city” useful. Thelatter query would help design better pedestrian and vehicleroutes to take into account the main flows of people.

We therefore consider a set of regions, which may be anyshape or size, and a set of objects moving throughout theseregions. We assume that it is possible to determine whichobjects are in a region, but we do not know precisely wherean object is in that region. We do not assume that objectsare always somewhere in the region set, so in the exampleof a mobile phone network, turning the phone off poses noproblems for our methods. We are interested in finding re-gions with useful temporal characteristics (thoroughfares,sinks, sources, and stationary regions) and rules that predicthow objects will move through the regions (spatio-temporalassociation rules). A source occurs when the number of ob-jects leaving a region is high enough. A sink has a highnumber of objects entering it. A thoroughfare is a regionthrough which many objects move - that is, many objectsenter and leave. A stationary region is where many ob-jects tend to stay over time, while a STAR describes howan object moves between regions. Together, these patternsdescribe many mobility characteristics and can be used topredict future movements.

We take the approach of mining our patterns on a timewindow by time window basis. We think this is impor-tant because it allows us to see the changing nature of the

1


patterns over time, and allows for interactive mining - in-cluding changing the mining parameters. Even though thepatterns we consider occur in a spatial settings, they areall temporal patterns because they describe objects move-ments over time, as well as capturing changes in the waythe objects move over time. To understand this, considereach pattern set ξi as capturing object movements over a‘short’ period of time. In our algorithms this is the inter-val pair [TIi, T Ii+1]. That is, ξi captures how the objectsmove between the time intervals TIi and TIi+1. Then, asthe algorithm processes subsequent time intervals, the pat-terns mined at these points will in general change, forminga sequence of pattern sets < ξi, ξi+1, ... >. This change inthe patterns that are output can be considered longer termchanges. Such changes in the patterns describe the changesin the objects behavior over time. Another way to thinkabout this is to consider the objects motion as a randomprocess. If the process is stationary, we would expect thepatterns to remain the same over time. If the process is notstationary, the patterns will change with time to reflect thechange in the way the objects move.

There are a number of challenges when mining spatio-temporal data. First, dealing with the interaction of spaceand time is complicated by the fact that they have differ-ent semantics. We cannot just treat time as another spatialdimension, or vice versa. For example, time has a naturalordering while space does not. Secondly, we also need todeal with the spatio-temporal semantics effectively. This in-cludes considering the effects of area and the time-intervalwidth not only on the the patterns we mine, but also in thealgorithms that find those patterns.

Finally, Handling updates efficiently in a dynamic en-vironment is challenging - especially when the algorithmmust be applied in real time. We adopt a data stream modelwhere spatial data arrives continuously, say as a sequence ofsnapshots S1, S2, ..., and the model that we mine must keepup with this. The algorithms must therefore perform onlya single pass in the temporal dimension. That is, the algo-rithm must not revisit Si once it has started processing Si+1- this means that the model must be incrementally update-able. Unless sampling techniques are used, such algorithmscannot do better than scale linearly with time. Since pro-cessing the spatial snapshots is expensive in general, we fo-cus our attention there. We deal with exact techniques inthis paper, but it is possible to use probabilistic countingtechniques together with our algorithms.

2 Contributions

We make the following contributions:

• We give a rigorous definition of Spatio-Temporal As-sociation Rules (STARs) that preserve spatial and tem-

poral semantics. We define the concepts of spatial cov-erage, spatial support, temporal coverage and tempo-ral support. Because these definitions retain the se-mantics of spatial and temporal dimensions, it allowsus to mine data with regions of any size without skew-ing the results. That is, we successfully extend associ-ation rules to the spatio-temporal domain.

• We define useful spatio-temporal regions that apply toobjects moving through such regions. These are sta-tionary regions and high traffic regions. The latter maybe further broken into sources, sinks and thorough-fares. We stress that these are temporal properties of aspatial region set, and show that they are special typesof STARs. We also describe a technique for miningthese regions efficiently by employing a pruning prop-erty, and preserve the spatial semantics of the data bydealing with different sized regions.

• We propose a novel and efficient algorithm for min-ing the STARs by devising a pruning property basedon the high traffic regions. This allows the algorithmto prune as much of the search space as possible (fora given dataset) before doing the computationally ex-pensive part. If the set of regions is R, we are able tosignificantly prune R (to A ⊂ R and C ⊂ R) resultingin a running time of O(|R|) + O(|A′ ×C ′|) instead ofO(R2), where A′ = A−A∩C and C ′ = C −A∩C.Our experiments show that this is a significant saving.Theoretically, it is also the most pruning possible with-out missing rules.

Our algorithms do not assume or rely on any form of index(such as an R-tree, or aggregated R-tree) to function or toobtain time savings. If such an index is available, the al-gorithms will perform even better. Our time savings comeabout due to a set of pruning properties, which are spatialin nature, based on the observation that only those patternsthat have a support and confidence above a threshold areinteresting to a user (in the sense that they model the data).

The rest of the paper is organized as follows. In Section3 we place our contributions in context by surveying relatedwork. In Section 4 we give several related definitions ofSTARs that highlight some of the differences in interpret-ing STARs. We close the section with a detailed exampleto illustrate the subtleties. In Section 5 we define hot spots,stationary regions and high traffic regions. In Section 6 wepropose an efficient algorithm for mining STARs which ex-ploits several properties of “high traffic” regions. The re-sults of our experiments on STAR mining are described inSection 7. We conclude in Section 8 with a summary anddirections for future work. The appendix contains proofs ofthe theorems we exploit.


3 Related Work

There has been work on Spatial association rules (ex-amples include [8, 3]) and temporal association rules (ex-amples include [2, 5]) but very little work has addressedboth spatial and temporal dimensions. Most of the workthat does can be categorised as traditional association rulemining [1] or sequential ARM applied to a spatio-temporalproblem, such as in [7].

The work by Tao et al. [9] is the only research foundthat addressed the problem of STARs (albeit briefly) inthe Spatial-Temporal domain. As an application of theirwork they show a brute force algorithm for mining specificspatio-temporal association rules. They consider associa-tion rules of the form (ri, τ, p) ⇒ rj , with the followinginterpretation: “If an object is in region ri at some time t,then with probability p it will appear in region rj by timet + τ”. Such a rule is aggregated over all t in the followingway: if the probability of the rule occurring at any fixed t isabove p, a counter in incremented. If the fraction of such oc-currences is over another threshold c, the rule is consideredimportant and output. The authors call p the appearanceprobability, and c the confidence factor. They do not dis-cuss the reasons for or the consequences of this choice. Theconfidence factor is really the support with respect to timeof the rule, and when interpreted in the traditional sense, pis really the confidence threshold of the rule. There is alsono support defined. That is, the number of objects for whichthe rule applies is ignored. For each time-stamp, their algo-rithm examines each pair of regions in turn, and counts thenumber of objects that move between the regions. It is abrute force technique that takes time quadratic in the num-ber of regions. They use sketches (FM-PCSA) for speed,have a very simple STAR definition and ignore the spatialand temporal semantics of the data (such as the area of theregions or the time interval width).

Other interesting work that deals with spatio-temporalpatterns in the spatio-temporal domain includes [11, 12, 4,6, 9]. [6] mine periodic patterns in objects moving betweenregions. Wang et al. [12] introduce what they call flow pat-terns, which describe the changes of events over space andtime. They consider events occurring in regions, and howthese events are connected with changes in neighbouring re-gions as time progresses. So rather than mining a sequenceof events in time, they mine a sequence of events that occurin specific regions over time and include a neighbourhoodrelation.

Ishikawa et al. [4] describe a technique for mining ob-ject mobility patterns in the form of markov transition prob-abilities from an indexed spatio temporal dataset of movingpoints. In this case, the transition probability pij of an (or-der 1) markov chain is P (rj |ri) where ri and rj are re-gions, which is the confidence of a spatio-temporal associ-

ation rule, although this is not mentioned by the authors.[11] mine frequent sequences of non spatio-temporal

values for regions.The work we have listed above is quite different from

ours. [9] considers a simple spatio-temporal associationrule definition, and the algorithm for finding the rules isbrute force. [4, 12] consider patterns that can be interpretedas STARs, but they focus on very different research prob-lems. The algorithm of [4] will find all transition proba-bilities, even if they are small. Amongst other things, ouralgorithm makes use of the fact that users will not be inter-ested in rules below confidence and support thresholds, anduses this to prune the search space. And most importantly,none of the related work consider the spatial semantics ofthe regions, such as area, nor do they consider spatial sup-port or similar concepts.

Dealing with the area of regions correctly is importantfor interpretation of the results. Many authors implicitlyassume that the spatial regions can be specified to suit thealgorithm. However, this is usually not the case. Cells in amobile phone network are fixed, and have a wide range ofsizes and geometries depending on geographic and popula-tion factors. Data mining applications have to be developedto work with the given region set, and we cannot ignorethe influence of different sized regions. In the case of min-ing mobility patterns of moving objects (including sources,sinks, stationary regions, thoroughfares and STARs), ignor-ing area will skew the results in favour of larger regionsbecause they will have more objects in them on average.By taking the region sizes into account, we avoid skewingthe results and make our STARs comparable across differ-ent sized regions. Finally, although it is possible to scalemost patterns by the area after they have been mined, thisis undesirable because it it prevents pruning of the searchspace. Our algorithms deal with the spatio-temporal seman-tics such as area effectively throughout the mining processand prune the search space as much as possible.

No previous work could be found, despite our efforts,that considers sources, sinks, stationary regions and thor-oughfares. We think these patterns are very important be-cause they capture temporal aspects of the way that objectsmove in space.

4 Spatio-Temporal Association Rules

Given a dataset T of spatio-temporal data, define a lan-guage L that is able to express properties or groupingsof the data (in both time, space, and object attributes).Given two sentences ϕ1 ∈ L and ϕ2 ∈ L that have nocommon terms, define a spatio-temporal association ruleas ϕ1 ⇒ ϕ2. For example, the rule “late shift work-ers head into the city in the evening” can be expressedas LateShiftWorker(x) ∧ InRegion(OutsideCity) ∧


Time(Evening) ⇒ InRegion(City) ∧ Time(Night).To evaluate whether such a spatio-temporal rule is interest-ing in T , a selection predicate q(T, ϕ1 ⇒ ϕ2) maps the ruleto true, false. The selection predicate will in general bea combination of support and confidence measures. For ex-ample, if the support and confidence of a rule R1 are abovetheir respective thresholds, then q(T,R1) evaluates to true.

The language L can be arbitrarily complex. We considerthe special case where objects satisfying a query move be-tween spatial regions. A query q allows the expression ofpredicates on the set of non spatio-temporal attributes of theobjects. We explore a number of definitions of such STARsin this section to highlight temporal subtleties. We deal onlywith the STAR of Definition 4.4 outside this section, so thereader can safey focus on this on the first reading, withoutmissing the main ideas of the paper.

Definition 4.1 STAR: Objects in region ri satisfying q attime t will appear in region rj for the first time at timet + τ . Notation: (ri, t,@τ, q) ⇒ rj .

Note that this rule distinctly defines the time in rj atwhich the objects must arrive.

Definition 4.2 STAR: Objects in region ri satisfying q attime t will be in region rj at time t + τ . Notation:(ri, t, τ, q) ⇒ rj .

The above definition is less rigid and allows objects thatarrived earlier than time t + τ to be counted as long as theyare still present at time t + τ . The next definition countsthe objects as long as they have made an appearance in rj

at any time within [t, t + τ ].

Definition 4.3 STAR: Objects in region ri satisfying q attime t will appear in region rj by (that is, at or before)time t + τ . Notation: (ri, [t, τ ], q) ⇒ rj .

This is generalised in our final definition:

Definition 4.4 STAR: Objects appearing in region ri sat-isfying q during time interval TIs will appear in regionrj during time interval TIe, where TIs ∩ TIe = ∅ andTIs is immediately before1 TIe. Notation: (ri, T Is, q) ⇒(rj , T Ie).

From these we are interested in the rules that have a highconfidence and high support . We will use the notation ri ⇒rj or ζ for a STAR when we are not concerned with its exactdefinition.

1That is, there does not exist a time-stamp t that is between the timeintervals in the sense that (t < te∀te ∈ TIe) ∧ (t > ts∀ts ∈ TIs).

Figure 4.1. Example data for spatio-temporalassociation rule mining. See example 4.7.

Definition 4.5 Define support of a rule ζ, denoted by σ(ζ),to be the number of objects that follow the rule, and thesupport (with respect to q) of a region r during TI , denotedby σ(r, T I, q), to be the number of distinct objects within rduring TI satisfying q.

We will consider the problem of more rigorous support def-initions that are more appropriate in a spatio-temporal set-ting later.

Definition 4.6 Define the confidence of a rule ζ whose an-tecedent contains region ri during TI , denoted by c(ζ), asthe conditional probability that the consequent is true giventhat the antecedent is true. This is the probability that therule holds and is analogous to the traditional definition ofconfidence and is given by c(ζ) = σ(ζ)/σ(ri, T I).

Note that all the definitions are equivalent when TIs = t,TIe = t + 1 and τ = 1. We illustrate the above definitionswith an example.

Example 4.7 Consider Figure 4.1 which shows the move-ment of the set of objects S = a, b, c, d, e, f, g in the time-frame [t, t + 3] captured at the four snapshots t, t + 1, t +2, t + 3. Assume that q = ‘true′ so that all objects satisfy


the query. We give examples for each STAR definition inturn.

Consider the STAR ζ = (r1, t,@1, q) ⇒ r2. From thediagram, b, c, e follow this rule, so the support of the ruleis σ(ζ) = 3. Since the total number of objects that startedin r1 is 5 = σ(r1, t) = |a, b, c, d, e|, the confidence ofthe rule is c(ζ) = 3

5 . For ζ = (r1, t,@2, q) ⇒ r2 we haveσ(ζ) = 2 because a, d follow the rule, and c(ζ) = 2

5 .For ζ = (r1, t,@3, q) ⇒ r2 we have σ(ζ) = 0 because noobject appears in r2 for the first time at time t + 3.

The STAR ζ = (r1, t, 1, q) ⇒ r2 is equivalent to ζ =(r1, t,@1, q) ⇒ r2. But for ζ = (r1, t, 2, q) ⇒ r2 we haveσ(ζ) = 4 because a, b, c, d follow the rule (for this STARdefinition we count them as long as they are still there attime t + 2), and c(ζ) = 4

5 . For ζ = (r1, t, 3, q) ⇒ r2 wehave σ(ζ) = 4 since a, b, d, e follow the rule (we don’tcount c because it is no longer in r2 at time t + 3), andc(ζ) = 4

5 .(r1, [t, 1], q) ⇒ r2 = (r1, t, 1, q) ⇒ r2 =

(r1, t,@1, q) ⇒ r2. For ζ = (r1, [t, 2], q) ⇒ r2 wehave σ(ζ) = 5 because a, b, c, d, e satisfy the rule. esatisfies even though it has left by t + 2. Since all ob-jects from r1 have made an appearance in r2 by t + 2we must have σ((r1, [t, k], q) ⇒ r2) = 5 for all k ≥ 2.For ζ = (r1, [t + 1, 1], q) ⇒ r2 we have σ(ζ) = 2 andc(ζ) = 2

2 = 1The STAR ζ = r1, [t, t], q) ⇒ (r2, [t + 1, t + k] is equiv-

alent to (r1, [t, k], q) ⇒ r2 for k ≥ 1. For the STARζ = r1, [t, t + 1], q) ⇒ (r2, [t + 2, t + 3] we have 5 distinctobjects (a, b, c, d, e) appearing in r1 during [t, t + 1] and6 distinct objects (a, b, c, d, e, g) appearing in r2 during[t+2, t+3]. The objects following the rule are a, b, c, d, eso the support of the rule is 5 and its confidence is 5

5 = 1.For ζ = r1, [t+1, t+2], q) ⇒ (r2, [t+3] we have σ(ζ) = 3and c(ζ) = 3

4 .

Counting the objects that move between regions is a sim-ple task. The main idea is that if S1 is the set of objects in r1

at time t and S2 is the set of objects in r2 at time t + 1 thenthe number of objects moving from r1 to r2 during that timeis |S1∩S2| (assuming objects don’t appear in more than oneregion at a time).

4.1 Extending Support into the Spatio-Temporal Setting

Defining support in a spatio-temporal setting is morecomplicated than we have considered so far. Specifically,the size of any spatial atom or term in the rule should affectthe support. That is, given the support in Definition 4.5,two rules whose only difference is the area of the region inwhich they apply will have identical support. Consider Fig-ure 4.2 where r1 ⊂ R1, and objects a, b, c, d move fromr1 to r2 between time t and t + 1. Then the rules r1 ⇒ r2

Figure 4.2. Example data showing objectsmoving from time t to t + 1.

and R1 ⇒ r2 have the same support2. However, amongthese sets of equivalent rules we would prefer the rule cov-ering the smallest area because it is more precise. A similarissue arises when we wish to compare the support of rulesthat cover different sized regions. Consider again Figure4.2. The support of r1 ⇒ r2 is 4 and σ(r3 ⇒ r4) = 2 whileσ(R1 ⇒ R2) = 6 which is higher than the other rules butonly because it has a greater coverage. This leads to theconclusion that support should be defined in terms of thecoverage of a rule.

Definition 4.8 The spatial coverage of a spatio-temporalassociation rule ζ, denoted by φs(ζ), is the sum of the areareferenced in the antecedent and consequent of that rule.Trivially, the spatial coverage of a region ri is defined asφs(ri).

For example, the coverage of the rule (r1, t, τ, q) ⇒ r2 isarea(r1) + area(r2). This remains true even if r1 = r2 sothat STARs with this property are not artificially advantagedover the others.

We solve the problem of different sized regions by scal-ing the support σ(ζ) of a rule by the area that it covers, togive spatial support.

Definition 4.9 The spatial support, denoted by σs(ζ), isthe spatial coverage scaled support of the rule. That is,σs(ζ) = σ(ζ)/φs(ζ). The spatial support of a regionri during TI and with respect to q is σs(ri, T I, q) =σ(ri,TI,q)/φs(ri).

Consider again Figure 4.2 and assume the ri have unitarea and the Ri are completely composed of the ri theycover. Then we have σs(r1 ⇒ r2) = σ(r1 ⇒ r2)/φs(r1 ∪r2) = 4/2 = 2, σs(r3 ⇒ r4) = 2/2 = 1 and σs(R1 ⇒R2) = σ(R1 ⇒ R2)/φs(R1 ∪ R2) = 6/4 = 3

2 . The rule

2Since a, b, c, d follow the rules, the support is 4 in both cases.


R1 ⇒ R2 no longer has an advantage, and in-fact its spatialsupport is the weighted average of its two composing rules.

We do not need to scale confidence because it does notsuffer from these problems. Indeed, increasing the size ofthe regions in a rule will on average increase both σ(ζ)and σ(ri, T Is), so larger regions are not advantaged. Con-fidence is also a (conditional) probability, so scaling it byspatial coverage would remove this property.

In a spatio-temporal database we must also consider thetemporal support and temporal coverage.

Definition 4.10 The temporal coverage of a rule ζ, denotedby φt(ζ), is the total length of the time intervals in the ruledefinition.

For example, the temporal coverage of the rule(ri, T Is, q) ⇒ (rj , T Ie) is |TEs| + |TEe| where|TE| is an appropriate measure of the time interval width.

Definition 4.11 The temporal support of a rule ζ, denotedby σt(ζ), is the number of time interval pairs TI ′ =[TIs, T Ie] over which it holds.

Note that we did not perform scaling by temporal cover-age. In short, this is because we view the temporal cov-erage as being defined by the user and so each rule minedwill necessarily have the same temporal coverage. A morecomplicated reason presents itself when we consider min-ing the actual rules. For example, assume the temporalcoverage of a rule ζ is τ . We have at least two options,either we count the temporal support of the rule during[t, t + τ ], [t + 1, t + 1 + τ ], [t + 2, t + 2 + τ ], ... or dur-ing [t, t + τ ], [t + τ, t + 2τ ], [t + 2τ, t + 3τ ], .... Scalingby temporal coverage would only make sense in the sec-ond case. If we assume an open timescale (one that has noend, or is sufficiently large that we can assume this) then thenumber of opportunities to gain a support count (that is, forthe rule to hold) in the first case does not depend on the sizeof τ . That is, the temporal coverage is not a factor.

Note that temporal support only applies to the casewhere a user is interested in which STARs re-occur overtime (and hence that STARs which rarely occur are not in-teresting). The reader should note that the definitions ofSTARs that we give apply to a specific time interval anddescribe how objects move during that time (indeed our al-gorithms look at each pair of time intervals TI ′ only once).This is quite general in the sense that the mined STARs canbe analysed for changes over time, for periodicities, or sim-ply aggregated in time to find recurrent patterns.

Both temporal and spatial coverage are defined by theuser (or by the application). Spatial coverage is inherentin the size of the regions. Temporal coverage is more flex-ible and determines the window for which rules must be

Figure 5.1. Illustration of the technique forfinding high traffic areas.

valid, but this choice is the same for all rules. When min-ing STARs we attempt to find rules that have a spatial sup-port above a threshold, minSpatSup, a confidence aboveminConf . If the user is interested in summarising STARsover time, we additionally output only those rules with tem-poral support above minTempSup.

5 Hot-Spots, High Traffic Areas and Station-ary Regions

Definition 5.1 A region r is a dense region or hot spot withrespect to q during TI if density(r, T I) ≡ σs(r, T I, q) ≥minDensity

We define a region r to have high traffic (with respect tosome query q) if the number of objects that satisfy q and areentering and/or leaving the region is above some threshold.A stationary region is one where enough objects remain inthe region. These patterns are a special type of STAR.

They are also easy to find. Consider two successive timeintervals TI1 and TI2. Then the number of objects (sat-isfying q) that are in r in TI2 that were not there in TI1

is the number of objects that entered r between TI1 andTI2. Let S1 be the set of objects (satisfying q) that are inr during TI1, and let S2 be the corresponding set for TI2.As demonstrated in Figure 5.1, we are clearly looking for|S2 − S1|, where − is the set difference operation. Simi-larly, the number of objects leaving r is |S1 − S2| and thenumber of objects that remain in r for both TI1 and TI2 is|S1 ∩ S2|.

Example 5.2 Consider again Figure 4.1 during TI ′ =[[t, t], [t + 1, t + 1]] ≡ [t, t + 1] and assume the thresh-old is 3. e, b, c enter r2 during TI ′, so r2 is a sinkand because they came from r1, r1 is a source. DuringTI ′ = [t + 1, t + 2], g, b, c remain in r2 so it is a station-ary region during TI ′. If the threshold is 2, it would also bea thoroughfare because a, d enter while e, f leave dur-ing TI ′. r2 is also a stationary region during [t = 2, t + 3]because a, b, d stay there.


To express high traffic regions as STARs, note that ifwe let ∗ be the the set of regions excluding r but includ-ing a special region relse, then the number of objects en-tering r during TI ′ = [TIi, T Ii+1] is just the support of(∗, T Ii, q) ⇒ (r, T Ii+1). We need relse to cater for thecase that an object ‘just’ appears (disapears) from the re-gion set. We model this as the object coming from (goingto) relse. This situation would happen in the mobile phoneexample when a user turns his/her phone on (off). We nowformally define high traffic areas and stationary regions.

Definition 5.3 A region r is a high traffic region with re-spect to query q if the number of objects (satisfying q) en-tering r (ne) or leaving r (nl) during TI ′ = [TIi, T Ii+1]satisfy

α

φs(r)≥ minTraffic : α = ne or nl

where minTraffic is a given density threshold and φs isgiven by definition 4.8. Note that ne ≡ σ((∗, T Ii, q) ⇒(r, T Ii+1)) and nl ≡ σ((r, T Ii, q) ⇒ (∗, T Ii+1)).

Such regions can be further subdivided. If ne/φs(r) isabove minTraffic we call that region a sink. If nl/φs(r)is above minTraffic we call it a source, and if a regionis classified as both a sink and a source we call it a thor-oughfare.

Definition 5.4 If the number of objects remaining in r,denoted by ns, satisfies ns

φs(r) ≡ σ((r, T Ii, q) ⇒

(r, T Ii+1))/φs(r) ≥ minTraffic then we call r a sta-tionary region. A stationary region may or may not be ahigh traffic region.

Note that if we define area(∗) = 0 then the definitionof high traffic areas is a statement about the spatial supportof special types of STARs. For stationary regions however,we get as a consequence of Definition 4.8 that ns

φs(r) = 2 ·

σs((r, T Ii, q) ⇒ (r, T Ii+1)). We define nα/φs(r) : α ∈e, l, s as the spatial support of these patterns.

The following theorem allows us to prune the searchspace for finding high traffic regions and stationary regions.

Theorem 5.5 If minTraffic ≥ minDensity then:1). The set of sources during [TIi, T Ii+1] are a subset

of dense regions during TIi,2). The set of sinks during [TIi, T Ii+1] are a subset of

dense regions during TIi+1 and3). The set of stationary regions during [TIi, T Ii+1] are

a subset of the regions that are both dense during TIi andduring TIi+1.

Proof: See appendix.

1

Figure 6.1. Illustration of the complete miningprocedure.

As a consequence, if minTraffic ≥ minDensity thenthe set of thoroughfares during [TIi, T Ii+1] is a subset ofthe regions that are dense at time TIi and at time TIi+1.

These properties prune the search for high traffic re-gions, so we can find high traffic regions by settingminDensity = minTraffic, mining all hot-spots andthen mining the hot-spots for high traffic areas.

6 Mining Spatio-Temporal Association Rules

In this section we exploit the high traffic area miningtechniques to develop an efficient STAR mining algorithmfor the STARs of definition 4.4. These STARs are parame-terised by a query q which we will omit from the discussionfor simplicity.

As a motivation to why this is useful, assume a set of re-gions R for which we have data for k time instances. Thenthere are |R|2(k−1) possible rules, since there are |R| start-ing regions and |R| finishing regions (staying put is possibletoo), and for each such pair there are up to k−1 time-framesto examine. Using a brute force method, this would require|R|2(k−1) counts of the number of objects in a region. Ouralgorithms address the quadratic time component.

The reader may find it useful to refer to Figure 6.1 whilereading the following.

The steps to mine STARs of definition 4.4 with spa-tial support above minSpatSup and confidence aboveminConf during the time interval TI ′i = [TIi, T Ii+i] areas follows:

1. Let sizeFactor = maxk(area(rk))+mink(area(rk))maxk(area(rk))

2. Set minDensity = minSpatSup · sizeFactor andmine all hot-spots during TIi and during TIi+1 to pro-duce the set Hi and Hi+1.


3. Set minTraffic = minSpatSup · sizeFactor andfind the set of high traffic areas and stationary regionsfrom Hi and Hi+1. Denote the set of sources by A, theset of sinks by C, the set of thoroughfares by T and theset of stationary regions by S. Recall from Theorem5.5 that A ⊂ Hi, C ⊂ Hi+1, S ⊂ Hi ∩ Hi+1 andT = A ∩ C ⊂ Hi ∩ Hi+1.

4. We will show that A contains all candidates for theantecedent of STARs, C contains all the candidates forconsequents of STARs and S contains all the STARswhere the antecedent is the same as the consequent.Using this we evaluate the rules corresponding to theelements of A×C−S×S and S for spatial support andconfidence3. We keep all rules that pass these tests.

We then apply the above procedure for the next succes-sive set of timestamps TI ′i+1 = [TIi+1, T Ii+2] and soon. We therefore generate a sequence of pattern sets (hot-spots, high traffic areas, stationary regions and STARs)< ξi, ξi+1, ξi+2, ... > over time. If desired, we aggregatethe patterns by counting the number of intervals TI ′ forwhich each of the the patterns hold. If the total number ofthese (its temporal support as defined earlier) is above thethreshold minTempSup after the procedure is complete,we output the pattern.

The TI are given by a schedule algorithm that splits upthe timestamps into a stream of time intervals. There aremany possible choices for this algorithm, two examples ofwhich we have considered in section 4.1.

An optional pruning method may be applied that takesinto account an objects maximum speed or other restric-tions on its movement. That is, a STAR will not exist be-tween two regions ri and rj if they are so far apart that it isimpossible to reach rj from ri during the time interval TI ′.Define a neighbourhood relation N(Ri, Rj) that outputs thesubset S of Ri × Rj such that S = ri, rj : N(ri, rj) =1, ri ∈ Ri, rj ∈ Rj. Here, Ri, Rj are sets of regions,and N(ri, rj) is 1 if and only if ri and rj are neighbours.By ‘neighbours’ we mean that rj can be reached from rj

during TI ′. This relation allows restrictions such as ‘oneway’ areas, inaccessible areas, and maximum speed of ob-jects to be exploited for further pruning the search space. Ifsuch a relation is available, we now need only to evaluateN(A,C) − S × S.

The reader should note that |A×C −S×S| ≤ |R×R|,and that the amount by which it is smaller depends on thedata and spatial support settings. We effectively prune thesearch space as much as possible given the dataset and min-ing parameters before doing the computationally expensive

3Note that S may or may not be contained in A∪C and may in fact bedisjoint. This is why we need to evaluate all of S for STARs. Since someoverlap may occur, we save repeated work by evaluating A × C − S × S

rather than A × C.

part. Our experiments show that this time saving is large inpractice, even for very small spatial support thresholds.

The reader will note that the stationary regions, hightraffic areas and STARs have spatial and temporal supportdefined for them, and apply over two successive time inter-vals TI ′ = [TIi, T Ii+1]. Hot spots are defined over onetime interval however. It is useful to have a definition of hotspots that applies for TI ′ to unify the patterns and to allow auniversal definition of temporal support. We have a numberof choices, but we feel the following is the most logical:

Definition 6.1 Define a hot-spot or dense region duringTI ′ = [TIi, T Ii+1] as a region that is dense during bothboth TIi and TIi+1.

That is, the set of dense regions during TI ′ = [TIi, T Ii+1]is Hi ∩ Hi+1 in our procedure above. Defining temporalsupport is trivial: it is just the number of TI ′s for which theregion is dense.

6.1 Justification of the STAR Mining Al-gorithm

We now present the theorem that underpins our STARmining algorithm:

Theorem 6.2 If sizeFactor · minSpatSup ≥minTraffic then during TI ′ = [TIi, T Ii+1]:

1). The set of consequent regions of STARs with spatialsupport above minSpatSup is a subset of the set of sinks,C.

2). The set of antecedent regions of STARs with spa-tial support above minSpatSup is a subset of the set ofsources, A, and

3). The set of STARs whose consequent and an-tecedent are the same and have a spatial support aboveminSpatSup correspond to a subset of stationary points,with equality when 2 · minSpatSup = minTraffic.

Proof: See appendix.

If regions are of different sizes, then in the worst casesituation where a very large region and a very small re-gion exist the pruning will be least efficient. In the limit-ing case we obtain the choice which gives the lower boundminchoice of region geometry sizeFactor = 1. On the otherhand, the best pruning occurs when all the regions are thesame size, in which case sizeFactor = 2 and set of station-ary regions corresponds exactly to the set of STARs with thesame antecedents and consequents.

So when all regions are the same size we set 2 ·minSpatSupport = minTraffic = minDensity in theprocedure above and we don’t need to check the rules cor-responding to S for support.


7 Experiments

In this section we present the results of our STAR miningalgorithm.

7.1 The Datasets

We used a modified4 version of the well known GSTD-TOOL to generate the data for our experiments. Theodoridiset al. [10] proposed the GSTD (“Generate Spatio-TemporalData”) algorithm for building sets of moving points or rect-angular objects. For each object o the GSTD algorithm gen-erates tuples of the form (id, t, x, y, ...) where id is a uniqueid, t is a time-stamp with t ∈ [0, 1] and (x, y) are the coor-dinates of the point in [0, 1]2. Object movements are config-urable by specifying the distribution of changes in location.

We generated four datasets for our experiments consist-ing of 10, 000 points each. We generated 101 instancesfor each of these points corresponding to the timestamps0, 0.01, 0.02, ..., 1 which gave us 101, 000 instances perdataset. That is, the points changed their location every 0.01time units which can be thought of as sampling the locationof continuously moving objects every 0.01 time units. Weuse the following partitioning of timestamps to generate ourintervals: TI ′ = [t, t+1], [t+1, t+2], [t+2, t+3], .... Thatis, the time intervals discussed in our algorithms becomeindividual timestamps and each successive pair of timestamps is used for STAR mining. The distributions usedfor object movement were X ∼ uniform(−0.01, 0.05)and Y ∼ uniform(−0.01, 0.1). This means that objectupdates follow the rules x = x + X , y = y + Y , and onaverage will move 0.02 to the right and 0.045 units up theunit square. Toroidal adjustment was used so objects wraparound the unit square when they reach its boundary. Theinitial distributions were Gaussian with mean 0.5 for bothx and y directions. Our four datasets differed only in thevariance of the initial distributions as shown in Figure 7.1.

For the parameters listed above, the objects make about4.5 loops of the workspace during the 100 timestamps in thedataset. This speed and the randomness of the motion alsohas the effect of spreading the objects out rather quicklyover the workspace. Therefore, even the compact datasetbecomes sparse toward the end of the 100 timestamps. Thecompact dataset is thus not as easy for our STAR miningalgorithm as it might first appear. Indeed, the datasets werechosen to thoroughly exercise our algorithm.

7.1.1 The Regions

We used four different region configurations, all grids. Thenumber of regions was varied, while keeping the total area

4The GSTD algorithm does not output data in time order. We modifiedit to do so because sorting the output was not feasible for large datasets

(a) Compact Dataset (σ2 =0.05)

(b) Medium Dataset (σ2 = 0.1)

(c) Sparse Dataset (σ2 = 0.2) (d) Very Sparse Dataset (σ2 =0.05)

Figure 7.1. Initial Distribution of the FourDatasets

covered constant at 154 = 50625 (the output of the GSTDalgorithm was scaled to this area). The numbers of regionswe tested were 36, 81, 144, 225, 324 and 441 in a n × ngrid configuration.

7.2 Evaluating the STAR Mining Algo-rithm

We evaluate the performance gains of the algorithm overa brute force algorithm performing the same task on the var-ious datasets, using different parameter settings.

Recall that the STAR mining algorithm first found denseregions, then used these to find sources, sinks, thorough-fares and stationary regions. It then used the sources as po-tential antecedents (A), the sinks as potential consequents(C) and evaluated all STARs corresponding to a subsetof the cross product A × C for spatial support and confi-dence. The brute force mining algorithm simply evaluatesall STARs corresponding to the cross product R×R whereR is the set of all regions.

We used a simple brute force technique to find the denseregions. We did not use a neighbourhood relation to furtherprune the search space in these experiments.

We varied the spatial support thresholds:minSpatSup ∈ 0.05, 0.075, 0.1. Due to our re-


gion configuration (which gives sizeFactor = 2),and using the results from the theory, we always haveminSpatSup · 2 = minDensity = minTraffic. Wealso used minConf = 0.0 (ie: no confidence threshold)and minTempSup = 1.

The choice of a very low spatial support threshold(minSpatSup = 0.05) was made so that many rules weremined for all the datasets, and that a high proportion of re-gions would be dense and high traffic regions according tothe corresponding thresholds. For example, for the 15 by 15region set, minSpatSup = 0.05 corresponds to a regionbeing dense (high traffic) if at least 22.5 objects occupy it(move into it or out of it). Since there are 10, 000 objectsand only 152 = 225 regions, this means that if the pointswere spread out evenly (which is almost the case for thevery sparse dataset, and all datasets at the end of the 101timestamps), each region would have more than 44 objectsin it, more than sufficient for the support thresholds. Andsince objects will move on average more than 2/3 of theway across a region during each timestamp, there will beplenty of objects moving between regions to provide sup-port for high traffic regions and STARs. With the finer re-gion configurations (18× 18, 21× 21) objects almost movean entire region width each timestamp. If the algorithm per-forms well on very low setting of support, then it is guaran-teed to perform even better for higher settings. The othersupport settings are still quite low but are large enough todemonstrate how big the benefits of pruning are. For thehighest (σs = 0.1) in the 15 × 15 region configuration thesupport threshold works out to be 45, barely greater than theaverage number of objects per region.

7.2.1 Results

As expected, the rules mined by our STAR mining algo-rithm, which is labeled in the figures as the ‘pruning’ tech-nique, were identical to those mined by the brute force algo-rithm. The time taken to mine the rules using our algorithmwas far superior to the brute force algorithm, as can be seenin Figures 7.2.

Recall that the benefit of our algorithm is to prune thesearch space of STARs in order to avoid the quadratic timeexplosion that occurs when the the antecedent and conse-quent of a rule both vary over the set of all regions. In ouralgorithm, this quadratic time component occurs only fora subset of the total regions. Since the size of this sub-set depends on the support thresholds and the spread ofthe data (assuming of course that the points are sufficientlydense, which was deliberately the case for our datasets), themore spread out the data is, the more regions become po-tential sources and sinks (and hence potential antecedentsand consequents) and the more regions must be examinedfor STARs. This effect is demonstrated in the results for the

1

(a) Compact Dataset

1

(b) Medium Dataset

1

(c) Sparse Dataset

1

(d) Very Sparse Dataset

Figure 7.2. Results on the different datasetswith different support settings. Axes are time(ms) vs number of regions


four datasets used.For the compact dataset case (Figure 7.2(a)), the time

taken by the pruning algorithm for all support thresholdsgrows significantly slower than the brute force approach.

For the very sparse dataset (Figure 7.2(d)), the time forboth algorithms grow at a similar rate for minSpatSup =0.05 but the pruning algorithm is still significantly faster.Recall that for this low setting of the support threshold, al-most every region becomes dense and a high traffic region.For the higher support thresholds, the pruning algorithmis able to prune more regions and subsequently performsmuch better.

The other datasets fall between these two cases. In allcases it can be seen that pruning is very beneficial and thatthe amount the search space can be pruned is determinedby the support thresholds. Since users will be looking forpatterns with high support, this is ideal.

8 Conclusions

We have presented a rigorous definition of spatio-temporal association rules (STARs) while retaining the se-mantics of spatial and temporal data. Furthermore, we havedefined temporal patterns in spatial regions: hot spots , sta-tionary regions, high traffic areas, sources, sinks and thor-oughfares. These can be used to prune the search space ofSTARs, but are also interesting in their own right. All of thepatterns we have introduced describe temporal features ofobject mobility datasets. By mining the patterns on a timeinterval by time interval basis, we can not only find currentpatterns, but also see how these patterns evolve over longerperiods of time. We also demonstrated how our techniquespruned the search space as much as possible before doingthe computationally expensive part, leading to very efficientmining algorithms.

Appendix: Proofs

Proof of Theorem 5.5: Let n(TIi) be the number of ob-jects in r during TIi, let nl(TIi) be the number of objectsleaving r during TIi (compared with n(TIi−1)), and simi-larly, let ne(TIi) be the number of objects entering r duringTIi. The set of objects leaving a region r during TIi+1 isa subset of the objects within that region during TI . Sonl(TIi+1) ≤ n(TIi). Similarly, the number of objects en-tering r during TIi+1 is a subset of the number of objectsin r during TIi+1, giving ne(TIi+1) ≤ n(TIi+1). Forr to be classified as a source during TI ′ = [TIi, T Ii+1]we must have nl(TIi+1)/φs(r) ≥ minTraffic. So ifminTraffic ≥ minDensity then density(r, T Ii) =n(TIi)/φs(r) ≥ nl(TIi+1)/φs(r) ≥ minTraffic ≥minDensity so r must be dense during TIi. Similarly,

for sinks we have density(r, T Ii+1) = n(TIi+1)/φs(r) ≥ne(TIi+1)/φs(r) ≥ minTraffic ≥ minDensity.

The set of stationary regions is the intersection of theobjects that are in the region during TIi (S1) and the setof objects that are in the region during TIi+1 (S2). Thestationary regions are thus clearly a subset of both S1 andS2. Therefore r can be a stationary region only if r is denseduring both TIi and TIi+1 with respect to the thresholdminTraffic.

Proof of Theorem 6.2: In the following the time inter-val TI ′ = [TIi, T Ii+1] is implicit. Consider two re-gions ri and rj where i and j range over all possible val-ues. Let nl be the number of objects leaving ri duringTI ′, let ne be the number of objects entering rj duringTI ′ and let nm be the number of objects moving from ri

to rj . Clearly nm ≤ ne and nm ≤ nl. For the ruleζ = (ri, T Ii, q) ⇒ (rj , T Ii+1) to have support of atleast minSpatSup we have σs(ζ) = nm

area(ri)+area(rj)≥

minSpatSup. We then have nl

area(ri)+area(rj)≥ σs(ζ)

and ne

area(ri)+area(rj)≥ σs(ζ) so nl

area(ri)≥ σs(ζ) ·

area(ri)+area(rj)area(ri)

and ne

area(rj)≥ σs(ζ) · area(ri)+area(rj)

area(rj).

Therefore by definition 5.3 on page 7, ri is a source if σs(ζ)·area(ri)+area(rj)

area(ri)≥ minTraffic and rj is a sink if σs(ζ) ·

area(ri)+area(rj)area(rj)

≥ minTraffic since in those cases we

have nl

area(ri)≥ σs(ζ) · area(ri)+area(rj)

area(ri)≥ minTraffic

and ne

area(rj)≥ σs(ζ) · area(ri)+area(rj)

area(rj)≥ minTraffic.

For both the source and sink to be found we mustthus have min area(ri)+area(rj)

area(ri),

area(ri)+area(rj)area(rj)

≥

minTraffic. Now because we require the rela-tionship to hold for all ri and rj we must min-

imise min area(ri)+area(rj)area(ri)

,area(ri)+area(rj)

area(rj)so that the

bound is strict. Since area(ri)+area(rj)area(ri)

is minimised forthe choice of regions ri = maxk(area(rk)) and rj =mink(area(rk)), the value we require is sizeFactor.

The last case is a consequence of the definition of thespatial support - that is, even if the antecedent and conse-quent are the same they are both still counted in the scalingfactor. In this case we have ri = rj . Let ns be the num-ber of objects remaining in region r = ri = rj . By thedefinition of spatial support we have σs(ζ) = ns

2·area(r) andfor ζ to have sufficient spatial support we require σs(ζ) =

ns

2·area(r) ≥ minSpatSup. Now r is a stationary region ifns

area(r) ≥ minTraffic so all rules ζ where ri = rj willbe found if 2 · minSpatSup ≥ minTraffic, with the setof such STARs being exactly equal to the set of stationarypoints if 2 · minSpatSup = minTraffic. The theoremfollows by noting thatmaxchoice of region geometry sizeFactor = 2.


References

[1] R. Agrawal and R. Srikant. Fast algorithms for mining as-sociation rules. In Proceedings of 20th International Con-ference on Very Large Data Bases VLDB, pages 487–499.Morgan Kaufmann, 1994.

[2] J. M. Ale and G. H. Rossi. An approach to discovering tem-poral association rules. In SAC ’00: Proceedings of the2000 ACM symposium on Applied computing, pages 294–300. ACM Press, 2000.

[3] Y. Huang, H. Xiong, S. Shekhar, and J. Pei. Mining confi-dent co-location rules without a support threshold. In Pro-ceedings of the 18th ACM Symposium on Applied Comput-ing ACM SAC, 2003.

[4] Y. Ishikawa, Y. Tsukamoto, and H. Kitagawa. Extractingmobility statistics from indexed spatio-temporal datasets. InSTDBM, pages 9–16, 2004.

[5] Y. Li, P. Ning, X. S. Wang, and S. Jajodia. Discoveringcalendar-based temporal association rules. Data Knowl.Eng., 44(2):193–218, 2003.

[6] N. Mamoulis, H. Cao, G. Kollios, M. Hadjieleftheriou,Y. Tao, and D. W. Cheung. Mining, indexing, and queryinghistorical spatiotemporal data. In KDD ’04: Proceedings ofthe tenth ACM SIGKDD international conference on Knowl-edge discovery and data mining, pages 236–245, New York,NY, USA, 2004. ACM Press.

[7] J. Mennis and J. Liu. Mining association rules in spatio-temporal data. In Proceedings of the 7th International Con-ference on GeoComputation, 2003.

[8] S. Shekhar and Y. Huang. Discovering spatial co-locationpatterns:a summary of results. In Proceedings of the 7th In-ternational Symposium on Spatial and Temporal DatabasesSSTD01, 2001.

[9] Y. Tao, G. Kollios, J. Considine, F. Li, and D. Papadias.Spatio-temporal aggregation using sketches. In 20th Inter-national Conference on Data Engineering, pages 214–225.IEEE, 2004.

[10] Y. Theodoridis, J. Silva, and M. Nascimento. On the genera-tion of spatio-temporal datasets. In 6th International Sympo-sium of Large Spatial Databases (SSD’99), pages 147–164.Springer-Verlag, 1999.

[11] I. Tsoukatos and D. Gunopulos. Efficient mining of spa-tiotemporal patterns. In SSTD ’01: Proceedings of the7th International Symposium on Advances in Spatial andTemporal Databases, pages 425–442, London, UK, 2001.Springer-Verlag.

[12] J. Wang, W. Hsu, M.-L. Lee, and J. T.-L. Wang. Flowminer:Finding flow patterns in spatio-temporal databases. In IC-TAI, pages 14–21, 2004.


Tracking the Lyapunov Exponent in data streamsRaphael Ladysz

George Mason UniversityISE Department, MSN 4A4

Fairfax, VA 22030Email:[email protected]

Daniel BarbaraGeorge Mason University

ISE Department, MSN 4A4Fairfax, VA 22030


Abstract— 1 Even though many physical phenomena exhibitnon-linear behavior, time series data mining techniques havelargely concentrated in analyzing linear processes. In this paperwe address the issue of measuring and tracking chaotic behaviorin a data stream. Chaotic behavior can be measured throughthe Lyapunov Exponent, which loosely speaking, measures therate of deviation between two neighbors in the series as timeevolves. While algorithms to measure the Lyapunov Exponentexist in the literature, no technique has previously developedto track it as data points arrive continuously to a sensor. Wepresent in this paper a technique that can effectively manageto track the exponent by taking advantage of the computationsdone previously to receiving a new batch of points. Measuringchaotic behavior has proven important in discovering changes ina system: for instance, it has been successfully utilized to predictthe onset of epileptic attacks in patients. We demonstrate, throughexperimentation, that our technique is robust with respect toa large range of choices for the parameters used, while beingable to quickly track drastic changes of chaotic behavior (fromlinear to non-linear phenomena and viceversa). We show resultsof applying our algorithm to a series of real data sets in a varietyof applications such as £nancial, environmental, and medical.

I. INTRODUCTION

In spite of the fact that non-linear time series are prevalentin a broad spectrum of £elds (e.g., [9]), not much effort hasbeen devoted in the £eld of data mining to analyze them.(Some exceptions to this can be found in [5].) Non-linear time-series are characterized by chaos, which loosely speaking, is aphenomenon found in signals that are “intermediate” betweenregular, periodic functions, and unpredictable, truly stochasticbehavior. Using classical techniques for signal analysis, suchas the Fourier transform, non-linear signals appear very closeto noise, or random behavior. In fact, to uncover the structureof chaos, one needs to resort to a reconstruction of the time-series in a different space: the phase space. This transfor-mation (which we will explain in a later section), takes intoconsideration how the time-series behaves at regular intervalswhen we consider different starting points for the observations.

One of the features that characterizes chaos, and which canbe measured in the phase space is the Lyapunov Exponent(LE). (In truth, there is an spectrum of such exponents,but analysis usually focus in the largest of them.) Roughlyspeaking, a LE measures the rate at which two points thatwere initially close in the signal (neighbors), diverge with timewhen mapped in the phase space. In other words, the LE is a

1This work has been sponsored by NSF grant IIS-0208519

measure of the sensitivity of the system to initial conditionsand, as a result, how far into the future one can predict itsbehavior. When LE > 0 (LE < 0), points that were initiallyclose in the time series diverge (converge) and with LE = 0they tend to stay close. To visualize this, consider a sinusoidalsignal. Choosing two close points in the early stages of thesignal, and tracking their behavior at regular intervals of time,one can easily see that the two points “stay” close as thesignal progresses. For signals that exhibit chaos, this is nottrue: the two points will diverge at a rate measured by the LE.The Lyapunov exponent is, thus, a quick way of measuringthe degree of chaos in a signal: a positive value indicates thepresence of chaos, while a zero or negative one indicates moreregular behavior.

The Lyapunov exponent has successfully been used topredict the onset of seizures in epileptic patients (e.g., [9]).The methods compute the value of the Lyapunov exponentat various times in an Electro-Encephalogram (EEG) signal,monitoring its drop. The theory behind this application saysthat the EEG signals are highly chaotic in a brain that functionsnormally (since neurons £re in an asynchronous, non-periodicway). However, in the onset of a seizure, brain cells start £ringsynchronously, producing a EEG signals that become periodic,making the chaos disappear. Hence, the Lyapunov exponentsshows a rapid decrease of its value at this point. It is possibleto detect this change several minutes before the actual seizuretakes place, giving time to the patient to prepare for it.

While several algorithms to compute the exponent can befound in the literature (e.g., [25], [10], [16]), there is notechnique to track the Lyapunov exponent in real time, i.e.,as the signal values are being received. All the experimentsreported in the literature simply compute the LE in the wholeset of points available at the time. Obviously, this naiveapproach breaks when one is faced with an ever-increasingdata set, as the performance would degrade with the numberof points needed to be processed. The focus of this paper is topropose an ef£cient technique to track the LE and to analyzevia experimentation the sensitivity of the method to the variousparameters involved in its computation, and the usefulness oftracking Lyapunov exponents in real-life data.

II. BACKGROUND

A crucial idea in the analysis of chaos in time series isthat the structure of the signal is not apparent in the one-


dimensional observations x(t) that constitute the time series,but rather in a space of vectors of larger dimension. The trans-formation to this space is called phase space reconstruction,which takes every point x(t) in the time series and transformsit to a vector y(t), which has the form shown in Equation 1,where dim is the embedded dimension of the reconstructedphase space. Takens theorem [20], which is beyond the scopeof this paper, gives the mathematical background for the trans-formation, showing topological equivalence of the original andreconstructed phase space.

y(t) = [x(t), x(t+τ), x(t+2τ), · · · , x(t+(dim−1)∗τ)] (1)

The set of points in phase space visited by this trans-formation is called an attractor. An attractor can have aninteger dimension -regular attractor- or a fractional dimension-a strange attractor. A dynamical system is described by itstrajectory - a set of points in phase space visited by the system.Only small part of the whole phase space is occupied by thedynamical system. This small subset is the system’s attractor,consisting of set of points (in pase space). This set of pointsmay be £nite (and in particular, just one element, called £xedpoint) or in£nite, resulting in periodic or chaotic attractor.

Two parameters de£ne the transformation. The £rst is thetime delay, τ , and the second the embedding dimension ofthe phase space dim. The £rst question is then how to set upthese two parameters, as Takens’ theorem does not show howto specify them.

Starting with dim, the basic idea is to choose its value asthe minimum value for which there are no false neighbors. Inshort, given a value for dim we call neighbors in the phasespace a pair of transformed vectors y(t1), y(t2), which arevery close, and correspond to the transformation of two closevalues in the time series x(t1), x(t2). If y(t1), y(t2) ceaseto be close when a larger dimension than dim is used forthe transformation, the neighbors are called false. Practicalmethods to estimate a good value for dim compute the fractionof false neighbors among all the neighbors in phase space andincrease the dimension until that percentage drops signi£cantly(approaches zero).

For the time delay, the approach is to use mutual informationin the original series. The aim is to minimize the averagemutual information for measurements at time t with respectto those at t + τ , so the two sets of measurements becomeindependent in a practical sense.

We have chosen the method of Buzug et al. ([3]) to com-pute these parameters. This method is based on geometricalconsiderations, it is relatively recent, and it has widespreadacceptance. It computes the embedding dimension and timedelay considering the space £lled by the corresponding at-tractor de£ned by the series. Essentially the method works byde£ning a £ll factor as the ratio of the average volume ofparallelepipeds that closely track the shape of the attractor tothe volume of the minimum hypercube that can completelycontain the attractor. To £nd an appropriate τ , one aims tomaximize the £ll factor, given the embedded dimension. Once

the value of τ has been selected, the embedded dimensionis optimized by looking for the minimum dimension dE forwhich, the £ll factor function of τ does not change when thedimension is increased by one (to dE +1). A way to check thisis to compute slopes of the two curves (for dE and dE + 1 invarious points of τ and compare them. When the set of slopesdo not vary much, the dimension has been found. Completedetails of the method are found in [3], and the software isavailable at [4].

As we stated in the introduction, for a dynamical system, thesensitivity to initial conditions is quanti£ed by the Lyapunovexponents. Two trajectories with very close initial conditionswill diverge considerably if the attractor is chaotic, and therate at which they diverge can be characterized by the largestLyapunov exponent [10]. There exists a family of Lyapunovexponents whose nature can be understood by considering asmall n-dimensional hyper-sphere of initial conditions (all veryclose to each other). As time progresses, the hyper-sphereevolves into a hyper-ellipsoid whose axes expand or contractat rates given by each of the Lyapunov exponents. If one of theexponents is positive (expansion), that is suf£cient evidence ofchaos.

Numerous methods to estimate the largest LE exist ([25],[22]). We have chosen to expand on the method presented byRosenstein et al [16] because of its qualities: it is fast, easy toimplement, and robust to changes in parameters. It has beenproven to be robust even in the presence of errors of estimationfor the embedding dimension, or the reconstruction delay. Itdoes not require, as other methods do, a large number of thedata samples. It is resistant to noise. And, most importantly,it can be adapted to provide tracking of the largest LE whichis the scope of this paper. In the next section, we describe themethod used by Rosenstein et al to estimate the largest LE fora given time series data. Later we will show how we expandedon the method to provide tracking for a data stream.

III. ESTIMATING THE LYAPUNOV EXPONENT

Given the phase space representation of the time series (1),the attractor can be represented as a matrix X of the form(X1, X2, · · · , Xdim)T , where each Xj is a phase-space vectoras shown in Equation 2.

Xj = [xj , x(j + τ), x(j + 2τ), · · · , x(j + (dim − 1) ∗ τ)](2)

Therefore, X is a M × m matrix, with m = dim . Tounderstand M , we need to realize that for an N -point timeseries, it is only possible to de£ne a limited number of phase-space vectors (each one of them needs dim components), insuch a way that Equation 3 holds.

M = N − (m − 1)τ (3)

The algorithm locates the nearest neighbor of each row onthis matrix. For this calculation, the Euclidian norm is used.Moreover, the algorithm imposes the restriction that the nearestneighbor for each row is to be found only within rows that have


a temporal separation greater than the mean period of the timeseries. In other words, for row Xj , the nearest neighbor is tobe searched among rows Xi, such that |j − i| > mean period.The mean period is estimated as the reciprocal of the mainfrequency of the series power spectrum. This restriction breaksany temporal correlation among neighbors. At the same time,each pair of neighbors £nd in such a way can be consideredas nearby initial conditions for different trajectories. Then thelargest LE will be estimated as the mean rate of separationof the nearest neighbors. We call this the minimal temporalseparation of neighbors, and denote it by W (i.e., |j − i| ≥W ).

To do this, one can view the LE as linked to the divergencebetween a pair of nearest neighbors (dj(i)) by Equation 4,where λj is the divergence rate, i is an index, and δt is thesampling rate of the time series. The equation measures howvery close initial conditions diverge in the i-th sample afterthe initial value.

dj(i) ≈ Cjeλ1(iδt) (4)

Taking the logarithm of both sides of Equation 4 yieldsEquation 5, which represents a family of lines (for j =1, 2, · · · ,M , each with a slope proportional to λj .

lndj(i) = ln(Cj) + λ1(iδt) (5)

The largest LE can be accurately calculated using least-square £t to estimate an average line with respect to thosede£ned by Equation 5, in a manner represented by Equation 6,where the notation <> indicates the average over all the valuesof j. The slope of the average line described by Equation 6 isan accurate estimate of the largest LE.

y(i) =1δt

< lndj(i) > (6)

We wanted to use Rosenstein et al. algorithm as the basis forone that could track the LE in a data stream. In other words,as new data entered the system, the LE will be recomputedand monitored for changes. The key to do this ef£ciently isto incrementally monitor the changes in nearest neighbors forall the vectors in the attractor. As a new batch (chunk) of dataarrives, new vectors can be de£ned in the attractor (new pointsin phase-space). At the same time, we discard an equivalentnumber of the older vectors. This procedure divides the streamin regions as shown in Figure 1. The region A is composed bytime-series data points to be discarded, along with the phase-space vectors they de£ne (as £rst attributes of the phase-spacevector). The region B represents the points in the time-seriesthat were already in the system before the arrival of the newchunk, while C is the new set of points. Both regions, B andC contain the time-series points that serve as the base for thenew computation of the LE.

Discarding points in A has the consequence of eliminatingsome phase-space vectors that may be nearest neighbors ofother vectors. At the same time, including points in C hastwo consequences. First, it creates the need for £nding nearest

Fig. 1. The regions of the data stream.

neighbors for new phase-space vectors, and secondly, thenewly created vectors may become nearest neighbors of oldvectors in region B. (Replacing others in that role.) Notice thatby discarding as many data points as those received newly, thedimension m of the matrix that de£nes the attractor is keptconstant (by Equation 3). Loosely speaking, a set of “old”phase-space vectors get replaced by a set of the same size of“new” phase-space vectors.

Figure 2 shows the pseudo-code of our algorithm. A crucialdata structure is NN , a (M + C) × 3 matrix that containsone entry for each phase-space vector. Each entry i containsthe index of the nearest neighbor of i, the distance betweenthe neighbors, and a binary value that indicates whether thisvector has been assigned a new nearest neighbor or not. (So,NN [i, 0] is an index j to the nearest neighbor of i, NN [i, 1] isa real number d, the distance between i, and j, and NN [i, 2]is either 0 or 1, with 1 indicating that the vector i has beenassigned a new nearest neighbor.) As a new chunk of points ofsize C is admitted, that exact number of “old vectors” (from 0to C − 1) are examined to see if they are in nearest neighborsof the vectors that stay in the system (those representing pointsin region B). If a vector j is found such that NN [j, 1] = i

, a substitute among vectors not in region A is found, andNN [j, 2] is set to 1. Next the remaining vectors in regionB (i = C, · · · ,M − 1) that have not changed neighbors areexamined to see if a new vector (i = M, · · · ,M +C − 1) is acloser nearest neighbor than the one the vector currently has.Finally, nearest neighbors are calculated for the new vectors.The matrix is re-ordered, pushing up the entries exactly C

positions, making room for a batch of new C entries (this isbest achieved by storing the matrix as a circular buffer). Ofcourse, all the NN [i, 2] are set to 0. The £rst M elementsof the re-ordered NN matrix are used as the basis for thenew computation of LE, as explained above. Note that onlythose vectors for which the nearest neighbor changed (markedby NN [i, 2] == 1) will produce changes in the computationof Equation 5. The pseudo-code for the computation of LE isshown in Figure 3. The code simply computes the divergence

for each pair of nearest neighbors and averages it over theinterval B + C. A new value of LE is produced for eachinvocation of this code, i.e., every time a new batch of C


Given C new vectorsFor i = 0, · · · , C − 1 /*vectors in region A*/

Search among j = C, · · · ,M − 1If NN [j, 1] == i /* i is the nearest neighbor of j*/

Look for nearest neighbor of j among vectorsk = C, · · · ,M + C − 1Mark NN [j, 2] = 1

For i = C, · · · ,M − 1 /*vectors in region B */If NN [i, 2] == 0

for j = M, · · · ,M + C − 1If d = dist(i, j) < NN [i, 1]

/*j is a closer neighbor*/NN [i, 0] = j

NN [i, 1] = d

NN [i, 2] = 1For i = M, · · · ,M + C − 1 /*vectors in region C*/

Search for nearest neighbor k of i

among j = C, · · · ,M + C − 1NN [i, 0] = k

NN [i, 1] = dist(i, k)NN [i, 2] = 1

Reorder NN matrix, by pushing elements C positions upUse the £rst M entries of the matrix NN to recompute LEwith Rosenstein et al algorithm.Set NN [i, 2] = 0 for all i

Fig. 2. Pseudo-code of the algorithm to process a new batch of points

points is processed.

IV. EXPERIMENTAL RESULTS

We conducted our experiments using an Intel Pentium 4running at 3GHz, with 2GB of RAM and 1MB of cachememory. The speed of the algorithm varied with the parametersettings and, obviously with the data set being processed, butwe can report an average speed of 2,600 samples per secondfor values of C ranging from 50 to 1,000, while processinga Lorenz signal. (The Lorenz signal is generated by Equation7.)

dx/dt = σ ∗ (x − y) (7)

dy/dt = r ∗ x − y − x ∗ z

dz/dt = x ∗ y − b ∗ z

The part of the code that re-computes τ and dim is triggeredby a drastic change in the value of LE (the de£nition of“drastic” is controlled by a parameter: the percentage ofchange on LE, which during the experiments was set at 50%).This computation amounted to 1% of the running time of thealgorithm, on average.

Modi£ed Rosenstein’s Algorithm (NN )For each pair of nearest neighbors i, j (NN [i, 0] = j)

For t = 0, t < B + C

d(t)+ = ln(dist(i[t], j[t]))For t = 0, t < B + C

average d(t) =< d(t) >

y(t) = K ∗ average d(t)/* K is a sampling period constant*/Apply square £t to y(t) and compute the slope of the linewhich corresponds to λ1

in Eq. 4

Fig. 3. Pseudo-code of the algorithm to compute LE, once the nearestneighbors matrix is set up by the code shown in Figure 2. Each invocationof this procedure outputs a value of LE for the points in B + C.

We conducted experiments to £nd out the minimum numberof samples that, when used with our algorithm were suf£cientto produce an accurate value of LE for a known time series.We used the data set of the Lorenz time series published byDr. Eric R. Weeks that can be found in [24]. For this data, theminimum number of samples that gave a satisfactory answerwas 800 points.

A. Sensitivity

We wanted to test the sensitivity of our algorithm to thefew tunable parameters and get a good set of values to startthe experimentation. The parameters we can change are:

1) Size of the A,B, and C regions: This de£nes thenumber of rows (M + C) that the matrix NN willhave. Obviously, regions A and C are of equal size (thesame number of points that enter the system leaves it).Moreover, the size of region C is linked to the rate atwhich the data stream arrives to the system. But wewanted to experiment with combinations of the two sizes(B and that of A and C) to understand how this affectsthe computation of LE.

2) Minimal temporal separation of neighbors, W : Asexplained before, this parameter ensures that the chosennearest neighbors are close spatially (in the phase-space), but not temporally (in the original time-seriesspace).

3) Time delay τ : The delay in the phase space transfor-mation.

4) Embedding dimension, dim: The embedding dimen-sion of the phase space.

We conducted a series of Analysis of Variance (ANOVA)tests, one for each variable in the list above, having the LEvalue as the dependent variable to measure. The data set usedwas the Lorenz time series [24] with 16384 sample points.The results of those tests are reported in Table I. (The testwere conducted, of course, with the part of the code that


Parameter Range F-Statistics p-value

W 100-300 2.09 0.146B 1000-1500 0.4 0.67C 200-400 1.67 0.21τ 18-28 0.282 0.76

dim 2-4 0.668 <0.005

TABLE I

RESULTS OF ANOVA TESTS OF EACH PARAMETER (AS THE INDEPENDENT

VARIABLE) AND THE LE VALUE (AS THE DEPENDENT VARIABLE). THE

SECOND COLUMN SHOWS THE RANGE OF VALUES FOR THE PARAMETER,

THE THIRD THE F-STATISTIC, AND THE LAST COLUMN THE P-VALUE.

re-computes τ and dim turned off.) The £rst column of thetable indicates the parameter, the second the range used forthe test, the third the F-statistic, and the last the p-value. Ascan be seen in the table, none of the p-values except theone for dim is small enough to reject the null hypothesisat the 95% con£dence (the parameter and the LE value arestatistically independent), which shows that the computationof the exponent is largely independent on the choice of thoseparameters (W,B,C, τ ) for the ranges shown. The last p-valueis small enough to reject the null hypothesis and conclude thatthe LE is sensitive to the choice of the embedded dimensiondim. So, with the exception of the embedded dimension, thecomputation of LE remains stable for large ranges of the otherparameters. This is indeed, very important for our application,as we do not have to constantly adapt our operating parametersto the data stream. We only have to be able to track theembedded dimension appropriately as the new data comes,which we do with the help of the software obtained from [4].

B. Tracking drastic changes

We conducted experiments mixing two known signals: alinear (periodic) time series with points coming from a sinu-soid equation, and a non-linear time series -Lorenz. The ideais to concatenate those time series and use our technique tosee how fast the algorithm can detect the drastic change fromchaotic to non-chaotic behavior (and viceversa).

The £rst test was conducted by building a time seriessinusoid-Lorenz-sinusoid and running the software with dif-ferent settings for W,B, and C. We wanted to measure howfast the software was able to report the right value of LEafter the signal had changed. (There are two transitions: the£rst from sinusoid -non-chaotic - to Lorenz -chaotic-, thesecond from Lorenz to sinusoid.) Table II reports the results.In each transition, we report the time as a multiple of C,indicating how many periods of new batches were neededto converge to the right value of LE. Two things can beobserved in the data. First, as C increases, the number of batchperiods to converge, of course, decreases. More important,the transition from non-chaotic to chaotic gets tracked fasterthan the converse one (chaotic to non-chaotic). We repeatedthe experiment in a signal of the form Lorenz-Sinusoid-Lorenz, and the same phenomena was observed in that case.This can be easily explained by the way the computation of

W B C S-L L-S

100 1000 200 1 4100 1000 300 1 3100 1300 200 2 6100 1300 300 2 5100 1500 200 3 7100 1500 300 2 5200 1000 200 1 4200 1000 300 1 4200 1300 200 2 6200 1300 300 2 4200 1500 200 3 7200 1500 300 2 4300 1000 200 1 5300 1000 300 1 5300 1300 200 2 7300 1300 300 1 5300 1500 200 2 8300 1500 300 2 5

TABLE II

RESULTS OF TRACKING A SINUSOID-LORENZ-SINUSOID SIGNAL. THE

NUMBER OF BATCHES (C) TO CONVERGE TO THE RIGHT VALUE IN THE

TWO TRANSITIONS: SINUSOID TO LORENZ (S-L), AND LORENZ TO

SINUSOID (L-S) ARE REPORTED.

LE is performed. In principle, in theory, the computation ofdelayed coordinates-based embedding requires in£nitely long,noiseless data. In practice, of course, we have to do with£nite data. For the sinusoid case, the result has to convergeto 0 (zero divergence between two close neighbors as they“move” in the time series). This can be approximated whenthe amount of sampling points of sinusoid time series covers(suf£ciently) many sinusoidal periods. Now, as divergenceis computed by adding up contributions from all pairs ofnearest neighbors (and then averaged) within current regionB, to obtain the average LE close to zero, the system - inorder to acquire enough data to ful£ll the condition statedabove - needs approximately B/C new data chunks, whichis consistent to the results obtained. For the other transition(sinusoid to Lorenz) as soon as the Lorenz signal starts, theperiodic condition (divergence zero) is violated and averagedcontributions of divergence become dominated by the chaoticLorenz data. (It is worth remarking that the difference inamplitudes between the Lorenz and Sinusoid signals haveno in¤uence on the results: identical results are obtained bymaking the amplitude of the two signals equal.)

C. Henon

The Henon attractor is a chaotic signal that can be generatedby the equations shown in 8. The data can be obtained in [24].The results over this signal are shown in Figure 6.

x′ = a + b ∗ y − x2 (8)

y′ = x


Fig. 4. Concatenated Sinusoid-Lorenz-Sinusoid signals and plot of thetracked LE.

Fig. 5. Concatenated Lorenz-Sinusoid-Lorenz signals and plot of the trackedLE.

D. Real Data Sets

The following are experiments performed over real data sets.The provenance of each set is indicated below. The parametersare set to W = 100, B = 1000, C = 200, unless otherwiseindicated.

1) EEG signal: We used an EEG signal registered at 102.4Hz and covering 117 seconds before, during and after a shortepileptic seizure, available on [15] (12000 sample points). Theresults and raw signal are shown in Figure 7. The resultingtracked LE is shown below the original time series in the sametime scale. It is clearly shown that the LE is much lower forthe time just around and during the seizure. Parameters W,B

Fig. 6. The Henon signal and plot of the tracked LE.

Fig. 7. The EEG signal and the plot of the LE tracked over time.

and C for this case are 100, 1500 and 400, respectively.

2) Roanoke, VA, temperature time series: This data isavailable at [12] . Figure 8 shows the time series of daily tem-peratures registered in Roanoke, VA during the period 1984-2003, along with the corresponding LE graph. We observethree points in the curve where the LE value went below zero.We were able to £nd out that in two of those times (Sept.1996 and Aug. 2001) ¤oods occurred in the area. (The ¤oodinformation can be found in [21] )We conjecture that the drastic change in the LE in thosecases was due to the ¤ood phenomena (although it is outsidethe scope of this paper to elaborate on the climatologicalsoundness of this conjecture). It is important to remark that


Fig. 8. Time series of temperatures in the Apalachian region over a periodof 20 years and the corresponding LE values.

these changes in behavior are impossible to detect in the rawdata, which looks very much “random.” (Although it is notrandom at all.)

3) Southern Oscillation Index (SOI): The SOI is de£ned asthe normalized pressure difference at the Paci£c Ocean surfacemeasured between Papeete, Tahiti and Darwin, Australia.The SOI is an easily quanti£able climatic parameter used tomeasure the strength of the atmospheric signal in local andregional data. The data can be obtained at [14] . Tracking ofthe LE for this data is shown in Figure 9 which shows verymoderate, but positive LE values throughout the period 1991-2004 (4904 sample points). This result agrees with previouswork done in the SOI data (see [7], [23]).

4) Currency exchange data: Figure 10 shows the plotsof the raw data for the daily British pound exchange rateduring the years 1973-1989 (3645 sample points). This data isavailable at [19]. The data shows no signs of chaos during theentire period (LE = 0). These results are in accordance withearlier work by [6] that claims no evidence of chaos in theforeign exchange rates.

5) NYSE data: The data used in this experiment consistsof 6430 sample points of daily returns in the NYSE, and isavailable at [13]. Figure 11 shows the LE tracked for NYSEdata during the period of 1967-1988. In two occasions thegraph shows sharp drops of LE values. These drops seem tocoincide with the 1973 fuel crisis and the 1987 stock marketcrash.

V. RELATED WORK

Several papers address the issue of computing the LyapunovExponent from time series data (although none of them aimsto track it from a data stream). Wolf et al [25] present analgorithm that computes the non-negative Lyapunov spectrumfrom experimental time series by looking at the long-term

Fig. 9. The SOI data and its tracked LE during the period 1991-2004.

Fig. 10. British pound exchange rate for the years 1973-1989 (lower £gure),and the corresponding LE series (upper £gure). No sign of chaos is found.

growth rate of elements in attractors of analytically de£nedsystems. This becomes a serious limitation of the methodin practice, since the analytical de£nition is rarely available.Eckmann et al [10] described their algorithm as performingthree main steps: reconstructing trajectories in the systemphase space, creating maps tangent to the trajectories in thereconstructed space and deducing the LE from the maps.Brown et al [2] adds the use of hypothetical higher degreepolynomial models. Their method can be seriously degradedby even small amounts of noise. Sato et al [17] presented amethod for estimating the largest LE from high dimensionalchaotic systems investigating the distance between two nearestneighbors on the system attractor. They assumed a logarithmic


Fig. 11. LE tracking of the NYSE daily returns for the period of 1967-1988

formula for LE and £xed a problem of slow convergence dueto instability. A more computationally expensive algorithm isproposed in [22]. Rosenstein et al. [16] developed an algorithmbased on the work in [17], which can effectively compute thelargest LE with small amounts of data, a property that makesit ideally suited for our data streaming purposes.

In the data mining literature very few articles have addressedthe issue of analyzing chaotic time series. Chakrabarty [5] usesthe phase space transformation to develop a method of non-linear forecasting for time series. They develop techniques toautomate the computation of the embedding parameters usingfractal analysis.

Perhaps the most widely known application of the use of LEis the prediction of epileptic seizures: the sudden decrease ofchaotic behavior in the EEG series has been proven a reliableindicator of the imminent onset of a seizure. This phenomenahas been replicated and documented in a variety of publishedworks (see for example [9], [8], [11], [18], [1]).

VI. CONCLUSIONS

In this paper we have presented a novel technique to trackthe Lyapunov exponent in a data stream. The exponent is anaccepted indicator of the level of chaos in the time series, andsudden changes in it are usually indicative of events that areworth looking in more detail. Tracking the changes is a wayof mining this unusual points in the time series.

Our algorithm, based in previous work by [16] is able toincrementally calculate the LE, re-using a large portion ofthe calculations made in the previous batch of points. It doesthis by £guring out which of the older points change nearestneighbors as a consequence of the incorporation of the newbatch of points, and by completing the computation of nearestneighbors for new points.

We have proven, by way of experimentation, that our tech-nique is very robust with respect to wide ranges of parameter

settings. We have also demonstrated how the technique iscapable of converging to known values of LE (for knowntime series such as the Lorenz signal), after a drastic changefrom processing radically different signals (e.g., a sinusoid).We have tested our method in a variety of real data, with verysatisfactory results.

REFERENCES

[1] Babloyantz, A., and Destexhe, A. (1986) Low-dimensional chaos inan instance of epilepsy. Proc. Natl. Acad. Sci. USA Neurobiology, 83,3513-3517.

[2] Brown, R., Bryant, P., and Abarbanel, H. D.. I. (1991). Computing theLyapunov spectrum of a dynamical system from observed time series.Physical Review A, 43, 27-87.

[3] Buzug, T., Reimers, T., and P£ster, G. (1990). Optimal reconstruction ofstrange attractors from purely geometrical arguments. Europhys. Lett.,13, 605-610.

[4] NCSL. Software on Nonlinear and Complex Systems. http://www-ncsl.postech.ac.kr/en/softwares/

[5] Chakrabarti D., and Faloutsos, C. (2002) F4: Large-Scale AutomatedForecasting using Fractals Proceedings of CIKM , Washington DC.

[6] Demos, C. Vassilicos and F. Tata. (1993) No Evidence of Chaos ButSome Evidence of Multifractals in the Foreign Exchange and the StockMarket, in Application of Fractals and Chaos. The Shape of Things,Editors: A.J. Crilly, R.A. Earnshaw and H. Jones, Springer-Verlag.

[7] Elsner, J. B. and Tsonis, A.A. (1992) Nonlinear prediction, chaos, andnoise, Bull. Am. Meteorol. Soc., 73, 4960, 1992.

[8] Iasemidis, L.D., Sackellares, J.C. (1991) The evolution with timeof the spatial distribution of the largest Lyapunov exponent of thehuman epileptic cortex. In. Dennis Duke, (Ed.), Measuring Chaos inthe Human Brain, 49-82. World Scienti£c Publishing Company, NewJersey, .

[9] Iasemidis, L.D., and Sackellares J.C. (1996) Chaos theory and epilepsy.The Neuroscientist, 2 118-126.

[10] Eckmann, J.P., Kamphorst, S.O. Ruelle, D., and Ciliberto, S. (1986)Lyapunov exponents from time series. Physical Review A, 34(6), 4971-4979.

[11] Moser H.R., Meier P.F., Wieser H.G., Weber B. (2000) Pre-ictalchanges and EEG analyses within the framework of Lyapunov theory.In: Lehnertz K., Arnhold J, Grassberger P, Elger CE (eds.) Chaos inBrain? Singapore: World Scienti£c, 96-111.

[12] National Climatic Data Center. http://lwf.ncdc.noaa.gov/oa/ncdc.html[13] NYSE. http://www.nyse.com/marketinfo/datalib/ 1022221393023.html[14] Queensland Government. SOI data http://www.longpaddock.qld.gov.au/

SeasonalClimateOutlook/SouthernOscillationIndex /SOI-DataFiles/index.html

[15] Quiroga, R.Q. EEG data http://www.vis.caltech.edu/ rodri/data.htm[16] Rosenstein, M.T., Collins, J.J., and De Luca, C.J. (1993) A practical

method for calculating largest Lyapunov exponents from small data setsPhysica D 65, 117-134.

[17] Sato, S., Sano, M., and Sawada, Y. (1987) Practical methods of mea-suring the generalized dimension and largest Lyapunov exponents inhigh-dimensional chaotic systems, Prog. Theor. Phys. 77 1-5.

[18] Slutzky, M.W., Cvitanovic P., and Mogul, D.J. (2001) Manipulatingepileptiform bursting in the rat hippocampus using chaos control andadaptive techniques IEEE Transactions on Biomedical Engineering 50,559.

[19] SSHL. Exchange rates daily. http://ssdc.ucsd.edu/ssdc/Exchage.Rates.Daily.text[20] Takens, F. (1981). Detecting strange attractors in turbulence. In D.

A. Rand and L. -S. Young (Eds), Dynamical systems and turbulence,Warwick, 1980. Lecture Notes in Mathematics, 898 (366-381). Berlin:Springer-Verlag.

[21] U.S. Geological Survey http://water.usgs.gov/pubs/circ/2003/circ1245/[22] Wales, D.J. (1990) Calculating the rate loss of information from chaotic

series by forecasting. Nature, 350 485.[23] Webster, P. http://www.usc.edu/org/seagrant/elnino/quotes.html[24] Weeks, E.R. http://www.physics.emory.edu/ weeks/research/tseries1.html[25] Wolf, A., Swift, J., Swiney, H., and Vastano, J. (1985). Determining

Lyapunov exponents from a time series. Physica-D 16D, 3, 285–317.


Workflow Process Models: Discovering Decision Point Locations by AnalyzingData Dependencies

Sharmila Subramaniam, Vana Kalogeraki, Dimitrios GunopulosComputer Science and Engineering Department

University of CaliforniaRiverside, CA 92521

Fabio Casati, Umeshwar Dayal, Mehmet Sayal, Malu CastellanosHP Labs

1501 Page Mill RoadPalo Alto, CA 94304

Abstract

Workflow technologies are being increasingly used bybusiness enterprises to enhance their process and serviceefficiency. Workflow process models, which form the back-bone of process automation, are manually designed. Thisprocess entails assumptions and errors. This leads to inac-curate models and inefficient process executions along withcomplexity in understanding of the process itself. In thiswork we present a novel technique for improving the accu-racy and thereby the efficiency of Workflow models. Ourmethod attempts to precisely position the decision points ina process model through data mining techniques. We im-plement methods to discover efficient positions for decisionpoints, and transform the model graph to enable removal ofredundant tasks.

1 Introduction

Business process management plays a vital role in anybusiness environment. As the first step toward improvingprocess efficiency, individual activities of processes are au-tomated with current technological developments. To meetdemands arising out of market competition, automating theexecution order (i.e., Process flow control) of these activi-ties becomes essential. This has led to the development ofWorkflow Systems. The current research work in the Work-flow Management System domain is toward business pro-cess model design, representation, process flow automation,system performance monitoring etc.Process modeling and workflow design has been a chal-

lenging step due to the difficulty in capturing business logic

by modelers. Though there are unwritten rules that are be-ing followed, being unwritten, they are not always capturedor revealed during the design phase. It is therefore possibleto overlook many aspects during modeling phase and arriveat an erroneous design [18].

Data logged in during execution of a workflow (knownas workflow log) can be analyzed as input for process re-design. This is referred as workflow diagnosis step in work-flow life-cycle described in [16]. An important field of re-search in workflow diagnosis is process discovery, where aprocess model is rediscovered from its workflow logs [2, 5].Workflow logs also serve as rich sources of data fromwhichsome significant details regarding process execution can beinferred. For example, in [8], the authors show that theworkflow log can be analyzed to infer situations that leadto exceptions during process execution.

In this paper, we propose a method to rediscover thepositions of decision points in a workflow model by ana-lyzing the dependencies between the input/output data ofthe events stored in the workflow log. We consider work-flow graph models that represent the activities of a processand the control flows (precedence relationship) and the dataflows (data dependency) between the activities. Decisionpoints are the XOR-nodes in a workflow graph model, thatdetermine the flow path depending on the value of the dataset available at the time of decision making.

As an example, in the student admission process modelshown in Figure 1(a), Incomplete Student File? is a deci-sion point. This node represents a decision making func-tion which returns a or based on relevant data. Sup-pose an analysis of the workflow log indicates that the de-cision at Incomplete Student File? does not take any inputfrom the activities Admission Decision Process and Contact


AdmissionDecisionProcess

Inform StudentAbout Decision

IncompleteStudent File?

Contact Student for Information

y

n

Get Application Contact SIS

...

...

(a)

Admission

ProcessDecisionContact SIS

Get Application

......

......

About DecisionInform Student

Contact Student For Document

Student File?Incomplete

y

n

(b)

Figure 1. (a) Graph Model for Student Admission Process (b) Restructured Model

SIS(Student Information System). We can feed this findingas an input to the workflow redesign. We could move thedecision point after the activity Get Application and alsoremove the activities Contact SIS and Admission DecisionProcess from the -path. Thus, for the instances havingincomplete student file, the execution time is reduced sig-nificantly.From the above example, we see that the discovery of

earliest positions for decision points lead us toward identi-fying and removing redundant tasks, thereby decreasing theprocess execution time.In this paper:

We assume an initial graph model describing a processand a workflow log corresponding to the executions ofthe process. Further, we require that the workflow logcontains the ordered set of events of each execution andthe input and output data values of these events.

We provide a systematic approach for extracting depen-dencies between the output data at different stages ofthe process execution and the outcomes of a decisionpoint. We apply classification algorithms to the work-flow log, considering the outcomes of the decision pointas classes. The classification rules with high accuracyreveal the data dependencies between output data of theactivities and the decision point.

We show how the classification rules can be interpretedto modify the original workflow graph. We illustratewith examples the following modifications to the work-flow model:

placing an existing decision point in an earlier posi-tion in the graph model, or

adding a new decision point to the graph model tocapture a business logic that was not represented inthe initial model

For cases where a decision point is moved to a new po-sition, we give a systematic repositioning method thatresults in a restructured process model which is equiv-alent to the original model. We show how redundantactivities can be identified and removed in the restruc-tured process model to improve the process efficiency.For repositioning the decision points, we consider work-flow models that are not nested. We plan to extend therepositioningmethod for nested models in future works.

We show through experiments that our approach is morescalable compared to the brute force method for discov-ering the dependencies. We propose a metric, basedon the depth of decision points, to compare the initialmodel graph with the restructured one.

2 Workflow Process Model

We model the process using workflow graph techniquesimilar to that described in [15]. This is homomorphic tothe other models proposed in the literature [17, 7, 9, 12].The graph model is made up of the following components:Nodes (N) and Flow Transitions (F). Nodes are further clas-sified into Task Nodes (T) and Router Nodes (R). In addi-tion, the process model has a start (S) and an end node (E)indicating the start and end of the process respectively.A flow transition can either be control flow transition or

data flow transition. A control flow transition is a directededge connecting two nodes, showing the direction of control

2


Task Router Control FlowTransition

Data Flow Transition

Start End

ForkSequence Choice MergeJoin

Figure 2. Workflow Graph Constructs

flow between these nodes. These transitions in the workflowgraph determine the order in which the activities have to beexecuted to accomplish the process successfully.

Data flow transitions are the edges connecting activi-ties having data dependencies [11, 3]. Workflow activitiescommunicate with related application components duringtheir execution. Each activity has an input data container and an output data container definedas:

, is a set of data items that consumes, where Workflow Data

, is a set of data items # that writes to, where # Workflow Data

whereWorkflow Data is the set of data items for the processunder consideration. A data flow transition exists betweena node pair % ' * iff some or whole of % isconsumed by * .

Figure 2 pictorially represents these building blocks ofthe workflow model. Their descriptions are as follows:Task Nodes represent the individual activities needed to ac-complish the process. Fork Nodes represent the AND split.The fork implies that the activities (following the fork node)could be executed simultaneously, but it doesn’t impose thecondition. These activity-paths split from a fork node has tosynchronize later (using the Join node) in the graph to pro-ceed with further tasks. The Join Node waits until all thecontrol flow in-transitions of the node are triggered, beforeproceedingwith the next task/activity. Choice Node (or De-cision Point) represent the XOR split, having mutually ex-clusive/ alternative paths out of it. Merge Node merges theexclusive paths out of the Choice node into one path. It istriggered when any one of the control flow in-transitions isfired.

A1 A2 Ax C m

l

n

An

Output(Ax )

Figure 3. Output of activity , . determines theselection of path (i.e., / , 0 or ) at the choicenode 1

3 Problem Definition

The efficiency of a process is characterized by the timetaken for the execution of the process invocations. Wrongplacement of DPs (Decision Points) leads to undesirableexecutions of the process whereas inefficient placement ofDPs lead to unwanted executions of activities, thereby in-creasing the execution time. An optimal position for a DPis one where all the resources, inclusive of the data values,are available. The sources of the data values consumed by adecision point could either be static i.e., data that has notbeen modified by earlier activities, or dynamic i.e., datathat has been modified and passed on to the DP by otheractivities. Dynamic data accounts for the decision point’sdependency on the activities. In complex workflow mod-els, some of the data flow dependencies are overlooked orwrongly predicted, resulting in suboptimal decision pointplacements. We propose to identify such data dependenciesfrom the workflow log and restructure the graph model.In a graph model 2 , let , denote the set of activities that

precedes a decision point 1 . When the decision made at a1 is dependent on the value of the output of an activity , ( , , ), we say that the activity , is correlated to thedecision at 1 . The problem therefore is to retrieve, for alldecision points 1 in 2 , the pairs ( , , 1 ), where the corre-lation between Output( , ) and the decision at 1 is greaterthan a threshold value. This is followed by a possible im-provement to the graph model as described in the followingsections.We note that there are various attributes of activities that

can be correlated to the choices made at a decision point.For example, we could check if the activation or completiontime of the preceding activities relate to the choices made.In our approach, we consider the output data of the activitiesto discover the correlation. Output data can be thought ofas a more relevant option, because it is mostly due to the ig-norance of the exact point of availability of data values thatthe decision points are not placed at their earliest positions.Algorithm 1 describes the brute force algorithm to dis-

cover the correlated (Activity, Decision point) pairs. Thisexhaustive search algorithm is prohibitive as its running

3


Algorithm 1 Given a process model graph , workflow logand the value of threshold, return those (activity, decisionpoint) pairs having correlation greater than threshold1: Retrieve the required tuples from warehoused logs2: for every path of the set of possible paths from thegraph model do

3: for every unique (activity, decision point) pair do4: Calculate correlation by traversing

5: end for6: end for7: Return those (activity, decision point) pairs with corre-lation greater than threshold value.

time is exponential in terms of the size of the workflowgraph. Therefore, we propose to find the correlations bymapping the problem to a classification problem. Workflowlogs are considered as the training data to be classified, andthe outcomes of the decision points are considered as theclasses. The mining engine derives a set of classificationrules. The classification rules are the mappings from theoutput of different activities of the process to the outcomesof the decision point. As an example, following are some ofthe classification rules indicating the dependency betweenthe task and the decision point in the subgraph shownin Figure 3.If then choose If and !

then choose "Accuracy of the classification i.e. probability with whichthe classification performed by the rule is correct, is alsospecified by the classifier. If any of the above classificationrules is with significant accuracy (which is quantified bythe modeler), it indicates the presence of high correlationbetween and .

3.1 Interpretation of Rules

The scheme for restructuring the workflow model de-pends on the implications of the classification rules gener-ated. Let us call the set of rules that are of significant accu-racy, and hence denote a correlation, as constructive rules.A constructive rule can involve all or some of the outcomesof the DP under consideration. There can be many applica-tions of a constructive rule based on its nature.

1. If for a given set of instances of the process, a particu-lar decision could be taken earlier, then a new decisionpoint can be added to the model graph to capture this.This can be given as a suggestion to the modeler.

2. If for all possible instances of the process, a decisioncan be taken at an earlier position, the corresponding

decision point can be repositioned to this earlier po-sition. This possible modification is also given as asuggestion.

3. When certain conditions at routing points are incor-rectly specified, then they could be identified by study-ing the exception logs. If the source of the exceptionis identified as an erroneous specification of a decisioncondition, then this information could be presented tothe modeler. Further analysis of this application cate-gory is not within the scope of this paper.

For example, let us consider the process model shownin Figure 4(a) where the activities $ $ ! $ ' and (work toward completion of the process and ) is a choicenode with * as the corresponding merge node.Consider a situation in the process where for all the in-

stances with . /and ! 4 , the

activity ( is processed irrespective of the output at otheractivities. The corresponding rule generated by a decisiontree classification algorithm would be:If . /

and ! 4 :Choose activity ( at ) step (Rule I).This falls under application category 1 as discussed above.This rule can be used to remodel the graph as shown in Fig-ure 4(b). For instances satisfying the above rule, the activity

( can be executed right after step. This is given as asuggestion to the modeler. Such classification rules in factcapture decision-making-rules which were overlooked bythe modeler initially.Let us consider another set of constructive rules gener-

ated by the decision tree algorithm for ) :If ' 9 : Choose activity ( at )If ' 9 : Choose activity ' at ) (Rule II).These rules rightly point out that the decision point is de-pendent only on the output of step and the other datadependencies might be incorrect. This falls under Category2, being valid for all possible instances of the process. Theinformation, when mined, results in possible shifting of thechoice node to the position after node; the incorrect dataflow transition from ! to ) is removed (see Figure 4(c)).We take this category for more detailed analysis in this pa-per. In the next sections, we discuss the algorithm and themodel redesign for this category, followed by an experimen-tal evaluation. A detailed analysis of the other categorieswould be considered as future works.

3.2 Identifying Redundant Activities

In the redesigned workflow model where the decisionpoints are moved to their earliest positions, some of theactivities can be identified as redundant activities i.e., theoutputs of these activities are not used in any of the suc-ceeding steps. The decision of which of the task nodes are

4


Ax A1 A2

A3

A4

D N

(a)

Ax

A’4

A1 A2

A3

A4CND

M

(b)

Ax

A1 A2 A3

A1 A2 A4’ ’

ND

(c)

Figure 4. (a) Example of a Process Model (b) Restructured process model after adding new decisionpoint to capture Rule I. Duplicate of the node is represented as . (c) Restructured processmodel after repositioning the decision point to capture Rule II. Duplicate of nodes and arerepresented as and respectively.

redundant, and therefore could be removed, is made after adetailed study of the existing data flow transitions betweenthe activities involved.

In Figure 5, refers to a decision point and is theactivity whose output is found to be correlated to ’s out-comes for all possible instances of the process (Category2). and are the activities that are executed between

and . Decision point is repositioned next to asshown in the figure. Since there is no data flow transitionfrom to in the original graph, is considered re-dundant and can be removed from the path consisting of

in the restructured graph (this should be done after carefulconsideration of other factors).

As an example of removal of redundant task in Category1, consider the graph in Figure 4(b). Activities such as

and are removed from the path leading to from thedecision node , because their output data will not be usedin the activity .

4 The Algorithm

As seen in Section 3, we identify the correlations bymapping our problem to a classification problem. Figure6 shows the architecture of the proposed technique. Theworkflow log preprocessing application reads the workflowlog from the warehouse and selects those attributes requiredfor classification algorithms. The attributes are tabulated inthe format required by the classifier to form the trainingdata. In our work, we use the decision tree classificationtool C4.5 ([10]) for mining the correlated pairs. The classi-fication rules formed by the classifier are analyzed and in-terpreted to decide about the repositioning of router nodes.A systematic repositioning is carried out following this. Thesteps involved in finding the correlated pairs and repositionthe decision points are given in Algorithm 2.

In the next section, we describe the steps and ofthe Algorithm 2.

5


Ax A1 A2

A3

A4

Ax

A1 A2 A3

A1 A2 A4’ ’

N

N

D

D

Figure 5. The decision point in the originalsubgraph, shown at the top, is correlated tothe output of task node . The data flowtransitions between task nodes , , and

are shown in the figure. Since there is nodata flow transition between task nodes and , the activity can be removed in therestructured graph as illustrated.

Workflow LogsWorkflow Log

Training Dataset

andClassificationRules

Classifier

Preprocessing

Interpretations

Modifications

Figure 6. Mining Architecture

Algorithm 2 Given a process model graph and workflowlog . is the set of task nodes and is the set of nodes in

. Return the improved graph , where all the decisionsare taken at their earliest possible positions1: Retrieve the required tuples from warehoused logs2: for each of the Choice nodes present in the modelgraph do

3: Define the classes as the outcomes of the choice node , ! % , & ! % etc.

4: Extract the tuples containing from L5: Prepare the Training data from the workflow log, in

the format required by classification algorithm.) + , . + , ! % / ) + , . + , ! & % / 3 3 3 / ) + , . + , ! 6 % 8 : ! %

6: Apply the classification algorithm C4.5 on the train-ing data and store the classifications rules and theiraccuracies

7: Analyze the rules to find highly correlated Task node,Decision point pairs ( < , )

8: end for9: Initialize to : = 10: for each of the ( < , ) obtained, carry out reposition-ing of decision point to be placed after < system-atically do

11: Find the set of nodes from < to as[ < , , & ,..., A , ]

12: In graph model , move < over 6 , then over 6 D , up-to over

13: end for14: Return

5 Repositioning Decision Points

Let us consider a given workflow graph and a corre-lated pair ! F / % . is transformed to a new graph suchthat the decision point is placed after the activity node Fand is equivalent to . We achieve the transformationby moving step by step to the task node F , maintainingthe equivalency of the process model at each step.The nodes between the pair ! F / % can be any of the fol-

lowing: a task node, a fork node, a join node, a choice nodeor a merge node. In Figure 7, we show the process of mov-ing the decision point over each type of node.

., I , J , K , L

and M represent task nodes or subgraphs. They could repre-sent a subgraph only if it does not violate the requirementsfor structured workflow. (If there is a choice-merge or afork-join structure between the pair ! F / % , the structurecan be considered as a task node during the repositioningsteps.) We have not shown any data flow transitions in theillustrated models and assume that they are duly updated inthe new model.

6


N

s

p

nm

D

A

N

s

p

m n

D

A A’

(a)

D

s

N

s

N

D’

q

D

pq

m n

p

M

m n

M M’

(b)

D

N

s

N

s

p

D D’

q

m n

q

m n

p

J

J J’

(c)

D

D

J J’

m n m

r

q

n

r’

r

s s

q’

p p

F

N

J

q

F F’

N

(d)

D

m nq

m

r

q

n

r’

r

s s

q’

C’

p p

C

M

N

C

D

M M’

N

(e)

Figure 7. The figures show the graph transformations when a decision point is moved over differenttypes of nodes. Duplicate of a node is represented as . (a) Moving decision point over tasknode . (b) Moving decision point over merge node . (c) Moving decision point over join node. (d) Moving decision point over fork node . (e) Moving decision point over choice node .

7


6 Experimental Setup

6.1 Generating Workflow Graphs and TrainingData

We generate the workflow graph models for our experi-ments using an incremental method described below. Thegraph generation procedure takes the following as inputs:number of nodes to be present in the graph and probabili-ties of adding a task node, a fork-join structure and a choice-merge structure at each iteration. Initially a simple sequencestructure with a single task node encompassed between astart and an end node is generated. During further iterationsof incremental additions, a random task node is chosen fromthe graph and is converted to either a sequence of task nodesor a fork-join structure or a choice-merge structure, with thegiven probability.The incremental graph generation assures that the gen-

erated model is free of structural conflicts. As shown in[1, 13, 14], a given graph is claimed to be correct if it canbe reduced to an empty graph by applying a set of reductionrules. This claim can be used to prove that a graph modelsbuilt by incrementally expanding a task node is correct i.e.free from deadlock and lack of synchronization, because itimplies reducibility to a single task node, and eventually toan empty graph.For the experiments, we generate models with variable

proportions of decision points by changing the probabilitywith which a choice-merge structure is added to the graphmodel.In our experiments, we generate random and partially

fixed synthetic training data sets. In the partially fixed datasets, we fix a particular outcome of a decision point basedon the value at a particular stage of process execution i.e. ata decision point, if value of a variable after activity was

then choose else choose . For ran-dom paths, the outcome at a decision point is chosen at ran-dom. This training data set is used to verify if the techniqueused indeed results in mining the hidden data dependenciesbetween (activity, decision point) pairs.

6.2 Metrics

We choose the following metric to analyze the perfor-mance of our technique: Average Decision Depth which is the ratio of sum of depths of decision points tothe number of decision points, measured per an instance ofprocess invocation.

! "

where is the total number of decision points in themodel, the depth of decision point # and "

the total

number of process instances.We observe the value of this metric for various complex-

ity levels of a the process model. Complexity of a modelis determined by a combination of its size and the numberof instance types of the graph (refer Appendix for defini-tion). Number of instance types is determined by the choicenodes present in the graph model. We vary the numberof choice nodes by changing the probabilities with whichchoice-merge structures are added to the model, at a givenpoint of graph generation.

7 Experimental Evaluations

Our approach to decision point discovery and model re-design can be categorized into four phases. Initially, theoriginal process graph is generated as described in Sec-tion 6.1, with required input parameter values. Next, pro-cess logs of necessary size are generated. (These two pre-processing steps can be substituted with a data-retrieval anda data-processing step when an actual process model is un-der study. The data retrieval task includes writing scripts(PL/SQL) to extract logs with desired task attributes, andthe data processing step includes creating a data analysis ta-ble to consist of only those values required for classificationalgorithms). The next two phases consist of applying clas-sification algorithm to the process execution logs and usingthe correlation information to redesign the process model.In the experiments we conducted, decision tree classifica-tion tool C4.5 was used for mining the correlation betweenthe output data of activities and the outcomes of decisionpoints. The redesign phase involves systematic reposition-ing of the decision points in the process model, leading toan equivalent enhanced model.We compared the performances of our technique and the

exhaustive search algorithm discussed in Section 3 in termsof finding the correlated (Task node, Decision point) pairs.We observed that the set of correlated pairs we discoveredwith the classification technique was the same as that weobtained with the exhaustive search algorithm. This showsthat the mining technique is as efficient as the brute forcetechnique in finding the pairs.In the following sections we analyze the performance of

the original model and the rebuilt model in terms of theirdecision depths . We also analyze how our algorithmand the brute force algorithm scale with size of the processmodel graph.

7.1 Decision Depth

Figure 8 shows the values of , percentage decreasein and percentage increase in before and afterremodeling. The values were observed for various graphcomplexity levels. The levels of complexity marked as

8


1 2 3 40

0.5

1

1.5

2

2.5

3

Complexity Level of the Graph Model

Dec

isio

n D

epth

DDAv

of Original ModelDD

Av of Restructured Model

(a)

1 2 3 420

30

40

50

60

70

80

90

Complexity Level of the Graph Model

Per

cent

age

% Increase in Decision Points% Decrease in Decision Depth

(b)

25 50 75 100 125 1500

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

Graph Size (Number of Nodes)

Tim

e (i

n Se

cond

s)

Brute ForceClassifier Approach

(c)

Figure 8. (a) of the original and the restructured model graph (b) Percentage decrease in thedecision depth and percentage increase in the number of decision points when the originalgraph is transformed. In (a) and (b), values are plotted for various complexity levels of graph. (c)Running time of the classifier approach and the brute force technique. Values are plotted for variousgraph sizes and the workflow log size is fixed as 500.

1

2

3

4

5 6

7 8

12

15

13 14

9

10 11

(a)

1

2

6

7 8

12

14

1515’

6’

7’ 8’

12’

14’

9

3’ 3

44’

5

11

5’

10

13

(b)

1

2

6

7 8

12

14

1515’

6’

7’ 8’

12’

14’

9

3’

44’

11

5’

10

13

(c)

Figure 9. (a) Initial process model. Decision point numbered 9 is dependent on task node numbered2. The figure also shows the data flow transitions between nodes numbered 3 and 10 and betweenand 5 and 10. (b) Restructured model. Decision point 9 is moved up and positioned after task node 2.(c) Since the task nodes 3 and 5 did not have data flow transitions to task node 11 (i.e., their outputswere not consumed by task 11), they are removed from the workflow model graph.

9


varying from 1 to 4 has the following (graph size, choice-merge probability) values:

Level 1: (100, 0.20)Level 2: (150, 0.25)Level 3: (200, 0.30)Level 4: (250, 0.35)

The complexity of process models increases with increasein number of nodes and with increase in the percentagecomposition of choice nodes in the graph model. We notehere that the efficiency of a given model is determined bythe average execution time of process instances and not bythe complexity of the corresponding graph model. Averageexecution time of a process decreases when the decisionsare taken at their earliest and the redundant tasks are re-moved. We observe in Figure 8(b) that the total numberof decision points increases after rebuilding the modelusing our technique. This shows an increase in the com-plexity of the rebuilt graph in terms of the number of choicenodes. However, we observe from Figure 8(a) that the av-erage depth of decision points present in a process instancedecreases after applying our technique, indicating a gain inprocess efficiency. From Figure 8(b) we can see that thisgain in process efficiency increases with increase in com-plexity of the model.

7.2 Scalability

We analyzed how our algorithm and the brute forcemethod scale with the size of the workflow model. The av-erage time taken for both the algorithms for various graphsizes is shown in Figure 8(c). As already seen in Section 3,with increase in graph size, the search space increases ex-ponentially for the exhaustive search algorithm. Thereforethe exhaustive algorithm scales exponentially with a largerexponent base as against the mining technique.

7.3 An Example

A model graph of 15 nodes is created and logs of vari-ous sizes are generated (see Figure 9(a)). We have shownthe data flow transitions and in the figure andhave avoided the others for simplicity. Correlation is em-bedded in the log between task node and the choice node. The C4.5 classifier identifies the correlation between and

, and the accuracy of the classification rule is 99%.

Now, the decision pointis repositioned as shown in Fig-

ure 9(b). The data flow transitions and areupdated in the redesigned graph as and .Once the decision point is shifted to its earliest position,the redundant tasks can be removed based on existing dataflow transitions. Let us assume that there were no dataflow transitions between and and between and

in the original model. This might imply that and would never be used if task was chosenat decision point

. Having moved the decision point

to

an earlier position (Figure 9(b)), the modeler can now re-move the redundant task nodes and , after making sure

and is not consumed by any othersucceeding tasks. The model after removing the redundanttasks is shown in Figure 9(c).

8 Related work

Warehousing process activity attributes like timestamps,resources used, input and output values of data etc. provesto be helpful for business level analysis (to assess quality,identify problems, and provide solutions) of business pro-cess [4]. The warehoused data typically is a set of workflowlogs reflecting the execution details of each of the processinstances. Mining of the workflow has been applied in var-ious phases of WfMS.A summary of the ongoing research in process mining

is given in [16]. Methods of automating the process modelconstruction through event data capture of the on going pro-cess were illustrated in literature [5, 6]. Another approachof process discovery proposed in [2] makes use of logs ofpast unstructured executions of the given process to con-struct the model. Here the dependencies between the activi-ties are identified from the order in which they occur in logsand this is used for creating a dependency graph, which issubsequently reduced to the final process model. A methodof exception analysis through workflowmining is discussedin [8]. In this work, the situations that lead to exceptionsare captured through data mining techniques and are usedfor predicting and handling exceptions. Event data analy-sis and process mining has been applied to model rediscov-ery in [18]. Data flow validation in workflow models anddata management in workflow environments are discussedin [11] and [3] respectively. However, little has been donein the area of logging and analyzing event output data of theprocess models (as opposed to event data) to enhance theprocesses.

9 Conclusions

We developed tools to discover the decision point loca-tions in a process model by extracting the hidden data de-pendencies between tasks and decision points. We also pre-sented a systematic approach to transform the model graphthrough a series of decision point repositioning. Ways toidentify and remove potential redundant tasks in the im-proved process model, through data dependency analysis,were suggested. The proposed technique was comparedwith the brute force method to study how the methods

10


scaled with graph size. Efficiency of the restructured modelwas studied by analyzing the average depth of the choicenodes in the model graph.

References

[1] W. Aalst, K. Hee, and G. Houben. Modelling and analysingworkflow using a Petri-net based approach. In G. Michelis,C. Ellis, and G. Memmi, editors, Proceedings of the secondWorkshop on Computer-Supported Cooperative Work, Petrinets and related formalisms, pages 31–50, 1994.

[2] R. Agrawal, D. Gunopulos, and F. Leymann. Mining processmodels from workflow logs. In EDBT ’98: Proceedingsof the 6th International Conference on Extending DatabaseTechnology, pages 469–483, London, UK, 1998. Springer-Verlag.

[3] G. Alonso, B. Reinwald, and C. Mohan. Distributed datamanagement in workflow environments. In Proc. 7th Inter-national Workshop on Research Issues in Data Engineering(RIDE’97), Birmingham, England, April 1997. , 1997.

[4] A. Bonifati, F. Casati, U. Dayal, and M.-C. Shan. Ware-housing workflow data: Challenges and opportunities. InThe VLDB Journal, pages 649–652, 2001.

[5] J. E. Cook and A. L. Wolf. Automating process discoverythrough event-data analysis. In International Conference onSoftware Engineering, pages 73–82, 1995.

[6] J. E. Cook and A. L. Wolf. Discovering models of soft-ware processes from event-based data. TOSEM ’98: ACMTransactions on Software Engineering and Methodology,7(3):215–249, 1998.

[7] D. Georgakopoulos, M. F. Hornick, and A. P. Sheth. Anoverview of workflow management: From process model-ing to workflow automation infrastructure. Distributed andParallel Databases, 3(2):119–153, 1995.

[8] D. Grigori, F. Casati, U. Dayal, and M.-C. Shan. Improvingbusiness process quality through exception understanding,prediction, and prevention. In VLDB ’01: Proceedings of the27th International Conference on Very Large Data Bases,pages 159–168, San Francisco, CA, USA, 2001. MorganKaufmann Publishers Inc.

[9] H. Lin, Z. Zhao, H. Li, and Z. Chen. A novel graph reduc-tion algorithm to identify structural conflicts. In HICSS ’02:Proceedings of the 35th Annual Hawaii International Con-ference on System Sciences (HICSS’02)-Volume 9, page 289,Washington, DC, USA, 2002. IEEE Computer Society.

[10] J. R. Quinlan. C4.5: programs for machine learning. Mor-gan Kaufmann Publishers Inc., San Francisco, CA, USA,1993.

[11] S. Sadiq, M. Orlowska, W. Sadiq, and C. Foulger. Dataflow and validation in workflow modelling. In CRPIT’04: Proceedings of the fifteenth conference on Australasiandatabase, pages 207–214, Darlinghurst, Australia, Aus-tralia, 2004. Australian Computer Society, Inc.

[12] W. Sadiq and M. E. Orlowska. Modeling and verification ofworkflow graphs. Computer Science Technical Report, De-partment of Computer Science, The University of Queens-land, 1996.

[13] W. Sadiq and M. E. Orlowska. Applying graph reduc-tion techniques for identifying structural conflicts in processmodels. Lecture Notes in Computer Science, 1626, 1999.

[14] W. Sadiq and M. E. Orlowska. Analyzing process modelsusing graph reduction techniques. Inf. Syst., 25(2):117–134,2000.

[15] W. M. P. van der Aalst, A. Hirnschall, and H. M. W. Ver-beek. An alternative way to analyze workflow graphs. InBanks-Pidduck, A., Mylopoulos, J., Woo, C., and Ozsu, M.,editors, Lecture Notes in Computer Science: Proceedings ofthe 14th International Conference on Advanced InformationSystems Engineering (CAiSE’02), volume 2348, pages 535–552. Springer Verlag, Berlin, 2002.

[16] W. M. P. van der Aalst, B. F. van Dongena, J. Herbst,L. Marustera, G. Schimm, and A. J. M. M. Weijters. Work-flow mining: A survey of issues and approaches. Data andKnowledge Engineering, Volume 47, Issue 2 , November2003, pages 237–267, Nov. 2003. InternalNote: Submittedby: hr.

[17] W. M. P. van der Aalst and K. M. van Hee. Business pro-cess redesign: A petri-net-based approach. Computers inIndustry, 29(1-2):15–26, 1996.

[18] A. J. M. M. Weijters and W.M. P. van der Aalst. Rediscover-ing workflow models from event-based data. In Proceedingsof the Third International NAISO Symposium on Engineer-ing of Intelligent Systems (EIS 2002), pages 65–65. NAISOAcademic Press, Sliedrecht, The Netherlands, 2002.

11


Computing Information Gain in Data Streams

Alec Pawling, Nitesh V. Chawla, and Amitabh ChaudharyDepartment of Computer Science and Engineering

University of Notre DameNotre Dame, IN

apawling,nchawla,[email protected]

Abstract

Computing information gain in general data streams, inwhich we do not make any assumptions on the underlyingdistributions or domains, is a hard problem, severely con-strained by the limitations on memory space. We present asimple randomized solution to this problem that is time andspace efficient as well as tolerates a relative error that hasa theoretical upper bound. It is based on a novel method ofdiscretization of continuous domains using quantiles. Ourempirical evaluation of the technique, using standard andsimulated datasets, convincingly demonstrates its practical-ity and robustness. Our results include accuracy versusmemory usage plots and comparisons with a popular dis-cretization technique.

1 Introduction

Increasing number of real world applications now in-volve data streams; e.g., applications in telecommunica-tions, e-commerce, stock market tickers, fraud and intrusiondetection, sensor networks, astronomy, biology, geography,and other sciences. These data streams, whether commer-cial or scientific, spatial or temporal, almost always containvaluable knowledge, but are simply too fast and too volumi-nous for it to be discovered by known techniques. The ob-jective of modern practitioners is to find time and memoryefficient ways of modeling and learning from the streamingdata, albeit at the cost of some possible loss in accuracy.

There are special cases in which it is relatively easy tolearn in data streams. One commonly used scenario is whenthe stream is assumed to be a random sample drawn froma stationary or a slowly shifting distribution, often calledthe “i.i.d. assumption”. In such a situation, a reasonablesized sample of the data stream can be assumed to describethe overall distribution (and hence, the entire stream) quiteaccurately, and the problem reduces to learning from such asample. Another common easy case is when the underlying

domain(s) of the data values is discrete, either nominal orconsisting of a few numerical values. In discrete domains,the data can be represented by simply counting the numberof instances of each value. Often this simple representationis sufficient and the memory usage much less than the actualsize of the stream.

In this paper we consider the problem of feature selec-tion in data streams based on computing information gain.Feature selection is a essential component of classificationbased knowledge discovery, and using information gain forit is one of the most popular methods (e.g., decision treelearning with C4.5 [12]). We would like to solve the prob-lem in a general setting without making any of the previ-ous simplifying assumptions; in particular, we make no as-sumption whatsoever about the underlying distribution ofthe data stream. The distribution, if any, can change rapidlyand arbitrarily, even classes may suddenly appear and thendisappear, as the stream zips by. Further, we do not restrictthe type of data; it can be spatial, temporal, neither, or both.Given such a stream of examples, each consisting of a fea-ture vector with values drawn from continuous or discretedomains, we want to be able to compute, at every point inthe stream, the maximum possible information gain as if ourset of examples was exactly the examples in the stream seenthus far. The constraints are that we see each example in thestream just once and we are allowed to store very few of theexamples we see — at most an order polylogarithmic. In ad-dition we can take at most an order polylogarithmic time inprocessing each example. These are the standard efficiencyconstraints on stream computations (see, e.g., [1]).

Related Work There has been research in classification indata streams: either on learning a single classifier [7, 4] ora set of classifiers [13]. A classification technique such asdecision trees includes the computing of information gainas a component. Hulten et al. [4, 9] build decision treesin data streams from nominal domains — and thereby alsocompute the information gain — under the i.i.d. assumptionon a stationary or a slowly shifting distribution. They make


a clever use of the Hoeffding bounds to decide when a suffi-ciently large sample has been observed and use that to cre-ate their tree. If they observe that the distribution is shifting,they change to a different tree that represents the new data.Gehrke et al. developed BOAT [7], an incremental decisiontree algorithm that scales to large datasets, often requiringjust 2 scans of the data (In streams we usually get just one).A noteworthy strength is that it can handle different splittingcriteria. Street and Kim [13] show how to build ensemblesof classifiers on blocks of data. They add subsequent classi-fiers only when the concept is not already encompassed bythe previous set of classifiers. Wang et al. [14] also imple-ment an ensemble based approach for streaming data; theyweigh each classifier based on their accuracy on the dataevolving over time.

Our contributions We give a simple time and space effi-cient randomized algorithm for computing information gainin streams consisting of data from discrete or continuous do-mains. We give strict theoretical bounds on the amount ofpossible error in our computation. The error can be reducedby relaxing the space constraint, allowing the user to choosethe appropriate balance between error and efficiency. Ouralgorithm doesn’t need to know anything about the domainsin advance, not even the maximum or minimum values pos-sible. It does need to have a reasonable estimate or an upperbound on the size of the entire stream (the number n) forour bounds to be valid. Our technique is based on an origi-nal method of discretization using quantiles in tandem witha previously known algorithm for maintaining approximatequantiles in data streams.

We demonstrate the utility of our technique through ex-periments using standard datasets under different streamingconditions. E.g., we simulate sudden changes in the un-derlying distribution as could be expected in temporal orsensor data. We also show results on a large dataset artifi-cially generated to stress-test the algorithm. The error in theinformation gain we compute is well within our theoreticalbounds. In addition, the feature rankings we compute arevery close to those computed using a precise computation,one that has no time or space constraints (essentially, canstore all examples in the stream). We plot space and accu-racy trade-off curves for the datasets, as well as compare ourresults with another popular discretization approach, equalinterval width binnning [5].

In the following section we give a description of ourtechnique that builds up starting from a naive solution. Fol-lowing the algorithmic description we present our experi-ments, the setup and the results. Finally, we draw conclu-sions.

2 A Randomized Memory-Efficient Compu-tation of Information Entropy

We want to compute the information gain for featuresin a stream of examples using a small amount of memory(relative to the number of examples seen). Specifically, atany point in the stream, for each feature Xi we would like tocompute the value vi such that a partition based on “Xi ≤vi?”of the examples in the stream seen thus far results in themaximum gain in information. We would like to do this forboth discrete as well as continuous features. In addition, wewant to ensure that the maximum space we use, in terms ofmemory, is a function at most an order polylogarithmic inthe number of examples. Note, we naturally assume that thenumber of examples in the stream is much much larger thaneither the number of features or the number of class labels.

In this section we present a randomized solution to theabove problem. We start by looking at a naive approachthat works reasonably well for nominal features but not forcontinuous ones. This leads to the well-investigated ideaof discretization of continuous domains; unfortunately, theknown techniques of discretization are not designed for datastreams and any adaptation is not guaranteed to work well.We present an original technique of discretization that guar-antees that only a small amount of error is introduced inthe computation of information gain. More importantly,the technique can be extended to work on a stream usinga small amount of memory, while introducing only a smalladditional increase in error — allowing us to achieve ourobjective.

Naive Approach Using Counters Let there be d fea-tures denoted by X = X1, X2, . . . , Xd and c

class labels denoted by the set Y. Let S =(x(1), y(1)), (x(2), y(2)), . . . , (x(t), y(t)), . . . be a stream ofexamples, in which x(t) is a feature vector and y(t) is a classlabel. (We will drop the sequence subscript (t) when notrequired.) Let Sn be the set of n examples seen till timet = n.

Suppose we wanted to answer the following question forall n in a memory-efficient manner: What is the informationgain by partitioning the examples in Sn at Xi ≤ vi? Let SL

n

be the subset of examples such that Xi ≤ vi and SRn the

rest of examples in Sn. Also, let Sn,y denote the examplesin Sn with class label y. Then the information entropy inSn is

I(Sn) =∑y∈Y

(−|Sn,y|

|Sn|lg

|Sn,y|

|Sn|

),

and the information gain is

gain(Sn, Xi, vi) = I(Sn) −

[|SL

n |

|Sn|I(SL

n ) +|SR

n |

|Sn|I(SR

n )

].

2


We can compute the above for all n by simply maintainingtwo counters for each class label y: one to track |SL

n,y| andthe other to track |SR

n,y|. This takes very little memory: 2c,where c is the number of class labels. If we wanted to an-swer the question for each feature, we can do the same foreach of the d features, taking 2c ·d space, still much smallerthan n.

The problem with the naive approach is that we actuallywant to answer a more involved question: For all n andfor all Xi, what is the maximum possible information gainby partitioning the examples in Sn along any point on thedomain of Xi? Now, even this question can be efficientlyanswered if each Xi is a nominal (discrete) feature. Let thenumber of possible values for Xi be mi. Simply repeat theabove approach for each possible value in a feature. Theamount of space taken is 2c

∑i mi, which is small enough

if we assume that the mi values are much smaller than n.This assumption, of course, cannot be made for continuousfeatures.

Using Discretization of Continuous Features Modify-ing continuous feature spaces to behave like nominal featurespaces is common in machine learning algorithms, and thesimplest method is to use a form of discretization. In this,a continuous domain is partitioned into a finite number ofbins, each with a representative value; as a result, the featurenaturally behaves like a nominal feature, although a smallamount of error creeps in. There are many known mech-anisms for discretization: unsupervised ones like equal in-terval width and equal frequency intervals [2], as well assupervised ones like entropy based discretization [6]. SeeDougherty et al. [5] for a review and comparisons of avariety of discretization techniques. Among these knowntechniques, some perform better than others, but all lack animportant characteristic: they are not designed to give anupper bound on the amount of relative error introduced intothe information gain computation due to the discretization.Furthermore, it also not know how to extend these methodsto compute information gain in streams. We solve both de-ficiencies by introducing the quantile based discretizationmethod for computing information gain.

Quantile based Discretization The essence of quantilebased discretization is simple: to compute the number ofexamples such that Xi ≤ vi, in a manner that bounds thefraction of error introduced by discretization, use “finer”bins in those parts where the number of satisfying examplesis small, and “coarser” bins where the number of satisfyingexamples is large. Given a fixed number of bins to use, thisintroduces the minimum relative error. But it requires thatthe bin boundaries cannot be decided in advance; they arebased on actual values taken by the examples in the featuredomain. We implement this idea using quantiles. We first

discuss how to use quantiles to calculate information gain.We then explain why we cannot use precise quantiles andpresent an approximate quantile structure that suits our pur-pose.

The φth quantile of a set of n elements, for φ ∈ [0, 1],is the φnth smallest item in the set. Hence φ = 1 de-notes the maximum element, and 0 < φ ≤ 1/n denotes thesmallest element. We maintain our bin boundaries at the fol-lowing quantiles: α−0, α−1, α−2, . . . for some α = (1+ε),where ε > 0. (It helps to think of α being a number like 2,although in practice ε is quite small.) Notice that the binsbecome finer as the rank denoted by the quantile decreases.

To estimate the number of examples with label y suchthat Xi ≤ vi we use the quantiles for the set of exampleswith label y ordered by values of feature Xi. We first findthe largest quantile with a value less than vi. Let this beφ = α−k. We then approximate the required number ofexamples by nα−k+0.5. This introduces a relative error ofat most 1 ± ε/2 in computing the number of examples.

To compute the entropy in a partition at Xi ≤ vi werepeat the above method for each y ∈ Y , using the estimatesof the number of examples to estimate the entropy in theleft side partition. It can be shown that this estimate has arelative error at most 1 ± 7ε. For the right partition, we cancompute the number of examples with label y by simplysubtracting the number in the left from the total number ofexamples with label y. This number can, however, have arelative error more than 1 ± 7ε if α−k happens to be largerthan 1/2. To avoid this, we use another set of quantiles,1 − α−0, 1 − α−1, 1 − α−2, . . ., to compute the number ofexamples in that situation.

Now if we want to find the value vi that results in theminimum entropy, we perform the above computation foreach value in the set of quantiles. Let v∗i be the requiredbest-split value and let the entropy by partitioning at Xi ≤v∗i be I∗i . We can guarantee that one of the quantiles, αk,will have a rank that is at most an α factor away from thatof v∗i and the entropy estimated at quantile αk is at most afactor

1 ± 7ε ± O(ε2)

away from I∗i . This error bound on the entropy doesn’ttranslate into an error bound on the information gain if thegain turns out to be very small compared to the entropy. If,however, we know that the true value gain(S, Xi) is reason-ably large, say, larger than (1/g) · I∗i , where 1/g > 0 thenwe can guarantee

|gaindisc(S, Xi) − gain(S, Xi)|

gain(S, Xi)≤ 7εg + o(ε).

Note that feature selection in a stream is interesting onlywhen the information gain is reasonably large. We caneasily repeat this process for each feature and computethe maximum possible information gain over all features

3


with bounded relative error. In doing so, we take O(cd ·(1/ε) logn) space.

Unfortunately, in order to compute quantiles for valuesin a data stream precisely, either the values must arrive insorted order or Ω(n) values must be stored [11]. The for-mer is an unreasonable assumption in a stream and the lat-ter sets us back to square one. Fortunately, approximatequantiles for values in a stream can be computed rather effi-ciently, and that makes quantile based discretization partic-ularly useful.

Approximate Quantiles in a Data Stream For φ ≤ 1/2,a δ-approximate φ quantile of a set of n elements is anelement that lies in position between nφ(1 − δ) andnφ(1 + δ). Gupta and Zane [8] have shown how to main-tain approximate quantiles for values in a data stream us-ing O(log2 n/δ3) space. They use a randomized data struc-ture that, on being queried, gives a correct δ-approximate φ

quantile with high probability, i.e., at least 1−1/n. We givea brief description of their technique below.

Gupta and Zane use randomized samplers to maintain

each quantile. Suppose we want to maintain the βi

nth quan-

tile. We sample each value in the data stream with a proba-bility T

βi and keep the smallest T values, where T is a rea-sonable small number. By doing this, we expect the rank of

the largest item in the sampler to be βi, i.e. the βi

nth quan-

tile. If we choose T carefully, as a function of δ and n, wecan actually ensure that with probability at least 1 − 1/n

the largest item in the sampler is a δ-approximation of thequantile. The samplers do, however, need to know n (or anupper bound on n) in advance to ensure this.

To compute information gain we simply have such asampler for each quantile for each feature-class pair. Us-ing approximate quatiles, instead of precise ones, increasesthe relative error by just a small constant factor. Thusthe information gain can be computed in a stream takingO(cd · log2n/ε3) space.

3 Experimental Set-up

We ran various experiments to check the robustness ofour approach. As we mentioned, we are interested in eval-uating not only the precision of the information gain com-putation but also the memory savings in using the samplersthat reflect the statistics and distribution of the data. We rantwo different variations of experiments: in one we randomlyselected the initial sample and then uniformly selected thesame size samples from the data; in the other we purposelyskewed the distribution of the data, such that the class distri-bution in each stream is different from the previous streamand even very different from the actual class distribution inthe data. We did the latter by sorting all the examples by

a feature value and then selecting samples to constitute thestream. We applied the former approach on all the datasets,while we applied the latter only on the artificial dataset forthe proof of concept. We are in the process of including ad-ditional datasets in the study. We benchmark our streamingapproach against regular equal interval width binning and(offline) precise computation of information gain using allthe data.

We also used different values of ε in our experiments.A lower value of ε results in a more accurate computationbut requires more memory. As ε increases, accuracy andmemory usage both decrease. Both the size and the num-ber of samplers used decreases as ε increases, and since thesamplers are holding fewer values, the quality of the ap-proximate decreases. We thus used ε ∈ 0.25, 0.5, 0.75.

3.1 Datasets

We used four datasets, as shown in Table 1, in our pa-per. Each one has varying characteristics, and three of thedatasets are real-world while one is artificial. The artificialdataset has 1 million examples and 2 classes with a bal-anced distribution. The independent features in the artificialdatasets are drawn by N(0,1). Random noise is added to allthe features by N(0,0.1). The features are then rescaled andshifted randomly. The relevant features are centered andrescaled to a standard deviation of 1. The class labels arethen assigned according to a linear classification from a ran-dom weight vector, drawn from N(0,1), using only the use-ful features, centered about their mean. The Can dataset wasgenerated from the Can ExodusII data using the AVATAR[3] version of the Mustafa Visualization tool. The portion ofthe can being crushed was marked as “very interesting” andthe rest of the can was marked as “unknown.” A dataset ofsize 443,872 samples with 8,360 samples marked as “veryinteresting” was generated. The Covtype and letter datasetsare from UCI repository [10]. The covtype dataset has 7classes with a highly imbalanced distribution; we selectedthe 10 continuous features from the actual 54 features in thecovtype data (the leftover 44 features are all binary). Theletter dataset has 26 classes that are fairly balanced.

Dataset Numberof Ex-amples

NumberofClasses

NumberofFeatures

Artificial 1,000,000 2 6Covtype 581,012 7 10Can 443,872 2 9Letter 20,000 26 16

Table 1. Datasets.

We used the following variations for each dataset:

4


• Artificial Dataset: We randomly selected examplesfrom the training set to constitute each stream. To eval-uate biased and changing data distributions, we imple-mented two scenarios. As the first scenario, we sortedthe examples by a feature and then sampled from theresulting data distribution. As the second scenario, wechanged the class distribution in the different chunks ofstreaming data. The original data distribution is 50:50.However, we set up a streams of size 1000 such thatclass sizes alternate between 10:990 and 990:10 afterevery 200000 examples. The comparison benchmarkfor both is the precise information gain calculation thatassumes every data point seen so far is stored.

• Covtype Dataset: We randomly selected examplesfrom the training set to constitute each stream. Thismaintained the original distribution.

• Can Dataset: We randomly selected examples from thetraining set to constitute each stream. This maintainedthe original distribution.

• Letter Dataset: We randomly selected examples fromthe training set to constitute each stream. This main-tained the original distribution.

3.2 Results

Tables 2 to 5 show the information gain by using our pro-posed quantile method, equal interval width binning, andthe precise offline calculation. The features in the tablesare sorted in decreasing order of information gain for theprecise computation. For both our method and equal inter-val width binning, we calculated information gain on thestreaming data. Each entry in the Table is the final informa-tion gain value once all the available data is streamed in. Asis evident from the Tables, if the streams are uniformly ran-domly sampled from the available data, then both the equalinterval width binning and our method achieve similar in-formation gain values for the features as compared to theprecise calculation. The key point is that the overall rank-ing of features is maintained. However, once we artificiallychanged the distribution of the data, by sorting on a featurevalue, the equal interval width binning computation of in-formation gain significantly deteriorated in its performance.The quantile-based method for computing information gainstill performs fairly well, and provides the same ranking tothe features as by the precise method. Note that while theactual information gain is not exactly the same as precisecalculation, it is the ranking of features that is more rele-vant for feature selection. So, at the end both the techniqueswill select the same top-ranked features, and the informa-tion gain values are highly comparable.

Figures 1 to 4 show the information trend for the bestfeature for all the datasets along with the specified assump-tions, where the best feature was chosen from the completeoffline evaluation. We wanted to evaluate the impact ofstreaming data on the best feature value. The benchmarkto our approach is the precise computation, which assumesthat every data element is stored as the stream arrives, thusrequiring more memory. We expected the best feature to bethe most sensitive to changes in the distribution of stream-ing data, as all the feature values and corresponding classesare not well represented in the first few segments. As evi-dent from the Figures, the quantile-based approximation ap-proach for all three ε values considered closely follows thetrend of the precise computation. We noted in the Tablesthat the final feature rankings are same as precise, and thevalues are very close. And the Figures establish the trendthat matches the precise computation. Figure 5 shows theinformation gain computation for the changing data distri-bution on the artificial dataset. Again, the quantile basedmethod closely follows the precise calculation.

Figures 6 to 9 shows the memory requirement of thequantile based approach as compared to precise. Examiningthe memory requirements of calculating information gainprecisely and with approximate quantiles for different val-ues of ε provides insight into the scalability of our method.While using the approximate quantiles with ε = 0.25 re-quires significantly more memory than calculating informa-tion gain precisely for all four data sets, using approximatequantile with ε = 0.5, 0.75 require significantly less mem-ory for the larger data sets. We expect the approximatequantile based approach to be more beneficial when thedatasets are very large in the order of millions of records, asdemonstrated by the artificial dataset. It is remarkable thatthe memory required by the quantile method is just 1/10thof the total memory requirement by the precise methodfor the much larger artificial dataset. We observe an or-der of magniture savings for both covtype and can datasets,which are moderately large for a streaming scenario. Forthe smaller letter dataset, with only 20K examples, the sam-plers require more memory to keep all the relevant statisticsthan what will be required for keeping the entire dataset.This is due to the fact that there are so many classes rela-tive to the number of examples. There are not many moreexamples for each class than there are locations in the ze-roth samplers, so the memory used to store values in theother samplers is wasted. Moreover, if the datasets are in-deed that small, one can easily store the entire dataset andeven recompute the information gain or re-build a classifieras the new stream arrives.

5


precise ε = 0.25 ε = 0.5 ε = 0.75 binning0.4378 0.4474 0.4365 0.4378 0.43760.0499 0.0577 0.0568 0.0499 0.04990.0052 0.0096 0.0120 0.0052 0.00520.0013 0.0026 0.0032 0.0013 0.00130.0004 0.0022 0.0022 0.0004 0.00040.0000 0.0009 0.0008 0.0000 0.0000

Table 2. Information gain of all features in theartificial data set.

precise ε = 0.75 binning0.4378 0.4558 0.00000.0499 0.0724 0.04980.0052 0.0243 0.00520.0013 0.0243 0.00130.0004 0.0174 0.0004

Table 3. Information gain of all features in theartificial data set when examples appear in in-creasing order of the most significant feature.

precise ε = 0.5 ε = 0.75 binning0.2989 0.3011 0.3143 0.29860.0820 0.0866 0.0850 0.08190.0613 0.0622 0.0651 0.06120.0380 0.0368 0.0456 0.03790.0217 0.0248 0.0279 0.02170.0156 0.0189 0.0220 0.01560.0150 0.0166 0.0186 0.01500.0128 0.0161 0.0156 0.01280.0089 0.0127 0.0181 0.00890.0079 0.0140 0.0159 0.0079

Table 4. Information gain of all features in thecovtype data set.

precise ε = 0.25 ε = 0.5 ε = 0.75 binning0.3966 0.3966 0.3990 0.4014 0.39710.3824 0.3824 0.3805 0.3864 0.38280.3721 0.3721 0.3702 0.3661 0.37290.3706 0.3706 0.3709 0.3693 0.37110.3387 0.3397 0.3374 0.3320 0.33970.2930 0.2920 0.2916 0.2861 0.29280.2811 0.2811 0.2862 0.2836 0.28120.2530 0.2530 0.2494 0.2451 0.25360.2169 0.2169 0.2111 0.1927 0.21710.2006 0.2006 0.1976 0.1995 0.20050.1970 0.1970 0.2032 0.2029 0.19790.0693 0.0693 0.0719 0.0723 0.06900.0501 0.0501 0.0519 0.0620 0.05000.0480 0.0480 0.0534 0.0554 0.04780.0341 0.0341 0.0340 0.0346 0.03440.0038 0.0038 0.0038 0.0042 0.0038

Table 5. Information gain of all features in theletter data set.

precise ε = 0.25 ε = 0.5 ε = 0.75 binning0.0132 0.0137 0.0158 0.0124 0.01310.0129 0.0127 0.0150 0.0136 0.01290.0109 0.0111 0.0112 0.0115 0.01090.0035 0.0040 0.0043 0.0060 0.00350.0029 0.0032 0.0040 0.0034 0.00290.0020 0.0021 0.0029 0.0030 0.00200.0013 0.0016 0.0017 0.0019 0.00130.0012 0.0014 0.0017 0.0024 0.00120.0006 0.0007 0.0009 0.0011 0.0006

Table 6. Information gain of all features in thecan data set.

6


0

0.1

0.2

0.3

0.4

0.5

100000 200000 300000 400000 500000 600000 700000 800000 900000 1e+06

Info

rmat

ion

gain

Number of examples

Information gain of the best feature of the artificial data set sorted by the most significant feature

precisee = 0.25e = 0.5

e = 0.75binning

Figure 1. Information gain trend for the bestfeature in the Artificial dataset.

0

0.1

0.2

0.3

0.4

0.5

0 100000 200000 300000 400000 500000 600000

Info

rmat

ion

gain

Number of examples

Information gain for the best feature of the covtype data set

preciseepsilon = 0.25epsilon = 0.5

epsilon = 0.75binning

Figure 2. Information gain trend for the bestfeature in the Covtype dataset.

0

0.1

0.2

0.3

0.4

0.5

0 50000 100000 150000 200000 250000 300000 350000 400000 450000

Info

rmat

ion

gain

Number of examples

Information gain for the best feature of the can data set



Figure 3. Information gain trend for the bestfeature in the Can dataset.

0

0.1

0.2

0.3

0.4

0.5

2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

Info

rmat

ion

gain

Number of examples

Information gain of the best feature of the letter data set



Figure 4. Information gain trend for the bestfeature in the Letter dataset.

7


0

0.1

0.2

0.3

0.4

0.5

100000 200000 300000 400000 500000 600000 700000 800000 900000 1e+06

Info

rmat

ion

gain

Number of examples

Information gain of the best feature of the artificial data set with an alternating class ratio

precisee = 0.25e = 0.5

e = 0.75binning

Figure 5. Information gain values for the bestfeature on the artificial dataset, where theclass distribution changes over the stream-ing data.

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

100000 200000 300000 400000 500000 600000 700000 800000 900000 1e+06

Info

rmat

ion

gain

Number of examples

Memory usage for the artificial data set


epsilon = 0.75equal width binning

Figure 6. Memory utilized for information gaincomputation on the Artificial dataset.

0

50000

100000

150000

200000

250000

300000

350000

0 100000 200000 300000 400000 500000 600000

Mem

ory

used

(K

B)

Number of examples

Memory usage for caclulating information gain on the covtype data set



Figure 7. Memory utilized for information gaincomputation on the Covtype dataset.

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

0 50000 100000 150000 200000 250000 300000 350000 400000 450000

Mem

ory

used

(K

B)

Number of examples

Memory usage for caclulating information gain on the can data set



Figure 8. Memory utilized for information gaincomputation on the Can dataset.

8


0

50000

100000

150000

200000

250000

300000

2000 4000 6000 8000 10000 12000 14000 16000 18000 20000

Mem

ory

used

(K

B)

Number of examples

Memory usage for caclulating information gain on the letter data set



Figure 9. Memory utilized for information gaincomputation on the Letter dataset.

4 Conclusions

We have designed a randomized solution to comput-ing information gain in general data streams, without mak-ing any assumptions on the underlying distributions or do-mains. The approach uses an original discretization tech-nique based on quantiles. The essential idea is to use finerbins where the relative error due to discretization can belarge. We tie this technique to a known method for comput-ing approximate quantiles in data streams and obtain a timeand memory efficient solution that has strict error bounds.

We have demonstrated the accuracy, memory efficiency,and robustness of the solution using a variety of datasets.We show that its memory usage is much lower than the cor-responding usage by a regular precise computation, and itsaccuracy much better than an approach using equal intervalwidth binning (which, in fact, completely breaks down if thedistribution in the data stream abruptly changes). Based onour theoretical and empirical analysis it is clear that our al-gorithm’s memory efficiency, relative to a regular computa-tion, will be even more dramatic for larger datasets (streamsthat consist of tens of millions to billions of examples). Thealgorithm’s robustness has been amply demonstrated in thetests at simulate sudden changes in distributions. We thusconclude that it is a practical solution to computing infor-mation gain in general data streams.

As a part of our ongoing research, we are testing largerdatasets with a variety of dataset distributions, includingones in which the classes can appear and disappear arbi-trarily in the stream. Later, we would like to extend the al-gorithm to build a decision tree in data streams. We hope todemonstrate the practicality of the solution for any statisti-cal or machine learning method that requires discretization

of continuous features, be they spatial, temporal, or other-wise.

Acknowledgements

We are grateful to Philip Kegelmeyer and Sandia Na-tional Labs for providing the Can visualization dataset.

References

[1] B. Babcock, S. Babu, M. Datar, R. Motwani, andJ. Widom. Models and issues in data stream systems.In Proc. of 21st ACM Sym. on Principles of DatabaseSystems (PODS 2002), 2002.

[2] J. Catlett. On changing continuous attributes into or-dered discrete attributes. In Proceedings of the Euro-pean Working Session on Learning, Springer-Verlag,pages 164–178, 1991.

[3] N. V. Chawla and L.O. Hall. Modifying MUSTAFA tocapture salient data. Technical Report ISL-99-01, Uni-versity of South Florida, Computer Science and Eng.Dept., 1999.

[4] P. Domingos and G. Hulten. Mining high-speed datastreams. In Knowledge Discovery and Data Mining,pages 71–80, 2000.

[5] James Dougherty, Ron Kohavi, and Mehran Sahami.Supervised and unsupervised discretization of contin-uous features. In International Conference on Ma-chine Learning, pages 194–202, 1995.

[6] I. Fayyad and R. Kohavi. Multi-interval discretiza-tion of continuous-valued attributes for classificationlearning. In Proceedings of 13th International JointConference on Artificial Intelligence, pages 1022–1027, 1993.

[7] Johannes Gehrke, Venkatesh Ganti, Raghu Ramakr-ishnan, and Wei-Yin Loh. BOAT — optimistic deci-sion tree construction. In ACM SIGMOD ConferenceInternational Conference on Management of Data,pages 169–180, 1999.

[8] Anupam Gupta and Francis X. Zane. Counting in-versions in lists. In SODA ’03: Proceedings on thefourteenth annual ACM-SIAM symposium on Discretealgorithms, pages 253–254. Society for Industrial andApplied Mathematics, 2003.

[9] Geoff Hulten, Laurie Spencer, and Pedro Domingos.Mining time-changing data streams. In KDD, pages97–106, 2001.

9


[10] C.J. Merz and P.M. Murphy. Uci repository of ma-chine learning databases. Univ. of CA., Dept. ofCIS, Irvine, CA., Machine readable data repository,http://www.ics.uci.edu/ mlearn/MLRepository.html.

[11] I. Munro and M. Paterson. Selection and sorting withlimited storage. In Proc. IEEE FOCS, 1980.

[12] J.R. Quinlan. C4.5: Programs for Machine Learning.Morgan Kaufmann, San Mateo, CA, 1992.

[13] W. N. Street and Y. Kim. A streaming ensemblealgorithm (SEA) for large -scale classification. InF. Provost and R. Srikant, editors, KDD’01, pages377–382, 2001. San Francisco, CA.

[14] H. Wang, W. Fan, P. Yu, and J. Han. Mining concept-drifting data streams using ensemble classifiers. InACM SIGKDD Conference, 2003.

10


Stream Mining for Network Management

Kenichi YoshidaUniversity of Tsukuba

[email protected]

Satoshi Katsuno,Shigehiro Ano,

Katsuyuki YamazakiKDDI R&D Laboratories Inc.

Masato TsuruKyushu Institute of Technology

Abstract

Network management is an important issue in maintain-ing the Internet as an important social infrastructure. Es-pecially, finding excessive consumption of network band-width caused by P2P mass flow is important. Finding Inter-net viruses are also an important security issue. Althoughstream mining techniques seem to be promising techniquesto find P2P and Internet viruses, vast network flow preventsthe simple application of such techniques. A mining tech-nique which works well with extremely limited memory isrequired. Also it should have a real-time analysis capabil-ity. In this paper, we propose a cache based mining methodto realize such a technique. By analyzing the characteris-tics of the proposed method with real Internet backbone flowdata, we show the advantages of the proposed method, i.e.less memory consumption while realizing real-time analysiscapability. We also show the fact that we can use the pro-posed method to find mass flow information from Internetbackbone flow data.

1 Introduction

Network management is an important issue to maintainthe Internet as an important social infrastructure. The treat-ment of P2P and Internet viruses are the two most importantissues of network management. Since the vast consump-tion of network bandwidth caused by P2P mass flow is be-coming so excessive, a method to find and prevent them isan important network management task to keep the Inter-net working optimally. To protect Internet security, findingInternet viruses is also an important issue.

Since both P2P and Internet viruses are rapidly makingnew varieties, automatic finding methods of their new va-rieties are desired. Stream mining techniques seem to bepromising techniques to find new P2P and Internet virusesautomatically. However, vast network flow prevents thesimple application of such techniques. A stream miningtechnique which works well with extremely limited mem-

ory is required. Also it should have a real-time analysis ca-pability. For example, 10 Gbps of network bandwidth cantransfer 100 Tera bytes of data per day. Since today’s In-ternet backbone has a broader bandwidth, a mining systemhas to handle more than 100 Tera bytes of data per day. Al-though a large computer with Giga bytes of memory can beused, the memory size of such a computer is still extremelysmall if we compare it to the amount of data to be analyzed.The real-time analysis capability is also indispensable.

In this paper, we propose a cache based mining algo-rithm. The original concept of the proposed algorithm isthe use of a fixed size cache memory to find frequent items.Through the best use of the fixed size cache, we hope torealize a stream mining method which can work well withextremely limited memory resources while realizing real-time analysis capability.

By analyzing the characteristics of the proposed methodwith real Internet backbone flow data, we show the advan-tages of the proposed method. We also show the fact that wecan use the proposed method to find mass flow informationfrom the Internet backbone flow data.

Section 2 of this paper first surveys related work and de-termines their limitations in order to clarify the motivationof this research. Section 3 explains our methods, and Sec-tion 4 reports on the experimental results. Section 5 exam-ines characteristics of the proposed method. Finally, Sec-tion 6 concludes our findings.

2 Related work

Monitoring Internet traffic is an extensively studied area,e.g. [1, 17, 20, 21]. IETF’s IPPM working group proposesa framework of IP performance metrics [20]. Their workis important in providing a baseline to compare the mea-sured results by standardizing the attributes to be measured.Surveyor [21] is a project to create measurement infrastruc-ture. NLANR [17] has a project to develop a large-scaledata collection system as the base infrastructure for variousdata analyses. CAIDA [1] is making various tools to ana-lyze network data. Their visualization tools cover various


analyses of network data.Analysis of measured data is also studied [15, 18]. Some

studies, e,g. [4], try to use data mining techniques to au-tomate analysis. Though [4] claims its functionality withrestricted memory, further research is necessary.

When considering the importance of data mining perfor-mance, various methods for frequent item finding have beenproposed. Among them, Coarse counting [7], Sticky sam-pling, Lossy Counting [13], hash-based approaches [2, 10],and the use of group test [3] are important methods. Thesemethods can quantify frequently appearing items withoutany omissions. However, we found that these methodshad poor performance when working with limited memory.These methods tend to overestimate the frequencies of oc-currence when the available memory is limited.

In this study, we investigate a method which uses a fixedsize cache memory to find frequent items. The managementof cache memory significantly affects the performance ofour method. The study of cache management has a longhistory. LRU based methods such as LFU, LRU-k [19], and2Q [11] form an important family of management strate-gies. However, extensive study on the use of these methodsfor frequent item finding has not been reported.

The memory management strategies we used in thisstudy, i.e. random2 and hash2, retain information on mul-tiply accessed entries. Random2 was originally developedin the study of spam filters [23] and its characteristics werereported in [24]. In this paper, we introduce hash2 as anenhancement of random2.

Among conventional studies, CPM [24] and Space-Saving [14] has the best performance with restricted mem-ory. We empirically show the advantage of our new methodin Section 4. Comparison with the second best method, i.e.hCount∗ [10], is also reported in this paper. Note that mostof the mining methods which work well with limited mem-ory, e.g. CPM, Space-Saving and hCount∗, are off linemethods. They tend to lack the function of handling socalled concept drift [22]. The method we proposed in thispaper has the ability to handle concept drift while retainingperformance under limited memory.

Finally, from the view point of flow measurement, thereexist a considerable number of works to measure huge traf-fic on high-speed links such as core routers in the Internetbackbone. Note that a flow here is a sequence of packetshaving the same five-tuples, that is, source/destination ad-dresses, source/destination ports, and protocol. To measure(process) a huge number of packets per unit time on a veryhigh-speed link, packet sampling is often employed such asNetFlow and sFlow in commodity routers. For retrievingthe original flow information from the sampled data, sev-eral methods have been proposed for obtaining flow statis-tics [5, 9] and for counting the frequency (i.e., the number ofpackets) of each of flows frequently appeared, which is of-

ten called an elephant flow [16]. On the other hand, to copewith the memory limitation of measurement systems torecord the per-flow information on a huge numbers of flows,the use of a kind of irreversible compression of informationby Bloom filter or its extension (space-code Bloom filter)has also been proposed for counting only elephant flows[6] or for roughly counting all flows in a multi-resolutionway [12]. However, these all methods generally suffer fromoverhead of complex off-line processing and difficulty offinding appropriate parameters of them for archiving a rea-sonable accuracy of counting, due to the nature of statisticalinference from sampled and/or compressed information.

In contrast with them, our method proposed in this pa-per is very light-weight and memory-efficient because itjust counts the frequency of each of appeared flows in afixed-sized table with a novel table-entry replacement strat-egy, while it can find and count elephant flows in an on-line manner with a reasonable accuracy. Note that, a slidingwindow-based on-line method of counting elephant flowshas been proposed [8], but it is still complex compared withour proposed method.

3 CPM-Stream & Hash2

We have investigated an off line version of CPM (Cache-based Pattern Mining) which uses a fixed size cache mem-ory to find frequent items [24]. Figure 1 shows its algo-rithm. It simply counts the frequency of items. A fixed sizecache was used to store frequencies.

If the size of cache is large enough, calculating the indexof a cache entry for an item is simple (Fig. 1, 5th line). Astandard hash technique can be used to find the index forthe recently encountered item.1 Free entries can be used forthe newly encountered items. When memory is restricted,cache entries have to be reused by deleting old entries inthe cache memory. How an entry is selected to be deletedsignificantly affects the performance of CPM. The memorymanagement strategy, Random2, (Figure 2) is the strategyCPM uses in such cases.

Among conventional studies, CPM and Space-Savingshow the best performance when available memory is lim-ited [24, 14]. However, both CPM and Space-Saving areessentially off line algorithms and cannot handle so calledconcept drift [22]. For example, when a user tries to transferdata using P2P software, a P2P flow starts at some point intime, and ends after it transfers the data of intention. Duringthat period, the packets of this P2P traffic appear frequently.However, after the P2P completes its data transfer, they arenot frequent.

Since original CPM cannot process this, we have devel-oped an on-line version. Figure 3 shows the on-line version

1This requires an auxiliary hash table. The need for this hash table isomitted in hash2. See later in this Section.


Algorithm CPMbegin

Create empty heap;while (input item) do

i = index of item in heap;increment heap cnt[i] by 1;if (heap cnt[i]>thresh hold)

print message;done

end

Figure 1. Algorithm of CPM

int i = random() % HEAP;for (p=1; (heap_cnt[i]>p); p++)

i = random() % HEAP;

Figure 2. C program code of Random2

of CPM, named CPM-Stream, and Figure 4 shows its mem-ory management strategy hash2.

To handle concept drift, CPM-Stream randomly de-creases one counter when its increments another counter forthe new item (See under lined line in Figure 3). By doingthis, CPM-Stream handles concept drift. Even if the fre-quency of some items is large, it gradually becomes smallas long as the item does not appear again.

We also enhance the memory management strategy.Hash2 (See Figure 4 for the pseudo code) is an enhance-ment of random2. It first calculates N hash values of a givenitem. N hash functions are used for this purpose. Next itgenerates N indexes from N hash values. Then hash2 se-lects the index which refers to the least frequent entry outof N entries referred by N indexes. We used 4 as N in theexperiments reported in the next Section. Although we didnot extensively seek the best N, 4 tends to make reasonably

Algorithm CPM-Streambegin

Create empty heap;while (input item) do

i = index of item in heap;increment heap cnt[i] by 1;if (heap cnt[i]>thresh hold)

print message;i = randomly select heap element;decrement heap cnt[i] by 1;

doneend

Figure 3. Algorithm of CPM-Stream

Function Hash2Input

Item: Data to be stored in CacheVariable

Hash[]: Table of Hash ValuesIdx[]: Table of Cache Index

beginCalculate N hash values from Item

and store them into Hash[]Idx[] = Hash[] % Cache Sizereturn Idx that refers least frequent entry

end

Figure 4. Pseudo code of Hash2

good results in various experiments.Although both hash2 and random2 have a mechanism to

implement the “Retaining multiply accessed entries” strat-egy, they select entries to be deleted randomly. Randomfunction is used by random2. Hash2 uses a hash functionas a substitute for the random function. Although random2needs an auxiliary hash table to find the index for the re-cently encountered item, hash2 does not require such table.Hash2 can find the index for the recently encountered itemby calculating its hash values. Thus, the memory efficiencyof hash2 is slightly improved upon from random2.

4 Experimental Results

In this section, we analyze an IP header log with CPM-Stream and other methods. The IP header log was recordedon a monitoring point of a commercial Internet backbone,and is a collection of MD5 values containing 164 million IPpackets. Only the source IP address, destination IP address,source port number, and destination port number are usedto calculate MD5 values. Since MD5 values and originaldata, i.e., set of IP address and port number, has one-to-onecorrespondence, finding frequently appearing MD5 valuesfrom this data means finding frequent/mass flows caused byP2P traffic.

Because of the file system limitation of the operating sys-tem we used, the IP header log is stored in multiple files.The size of each file is 2G bytes. The first file stores thepackets of first period. The second file stores the succeed-ing period, and so on. About 900 sub-files are used to storethe entire IP header file.

4.1 Concept Drift in Internet Data

We first check how the frequency of flow changes. To dothis, we make a list of MD5 values which appear more than1,000 times in the 450th file. Next we check how many


0

50

100

150

200

250

300

350

400

450

500

0 100 200 300 400 500 600 700 800 900

Num

ber

of F

ound

Flo

ws

File Number

Figure 5. Frequency Drift of IP Flows

of them also appear in other files. Here a PC with suffi-cient memory to hold all the information was used. Figure5 shows the results. The vertical axis shows how many ofthe MD5 values that appear on the 450th file also appearin the other file. The horizontal axis is the file number andrepresents the time sequence.

As clearly shown in the figure, the frequency of flowsare changing. As the file numbers’ difference becomeslarge, i.e. as the difference of the files’ creation time be-comes large, the number of found flows in both files be-comes small. This is due to the drift of flow frequencies.Thus stream mining methods have to handle concept driftto analyze this drift of flow frequencies.

Note that in each of the files which store IP header infor-mation, packets of the top 1,000 flows represent about 50%of all of the packets. Thus finding the top 1,000 flows cancontribute to finding mass flows.

4.2 Comparison of Memory Efficiency

As we explained before, performance under limitedmemory resources is important in applying mining meth-ods to analyze network data. Among conventional studies,original CPM [24] has the best performance under limitedmemory conditions. Recently [14] reports Space-Savingwhich has similar memory performance with theoretical up-per bound of memory usage. hCount∗ [10] has the secondbest performance. Figure 6 compares CPM-Stream, origi-nal CPM (i.e. off line version of CPM), and hCount∗.

In this experiment, we measured the performance of eachmethod by changing the memory size, i.e. the number ofcache entries. The entire IP header log data was used andthe flows whose packets appear more than 10,000 times aremarked as frequent. CPM-Stream and CPM make under-estimation errors. They both underestimate the frequencyof data under limited memory conditions. hCount∗ makes

0

500

1000

1500

2000

2500

100 1000 10000 100000 1e+06 1e+07 1e+08 1e+09

Num

ber

of E

rror

s

Number of Cache Entries

hCount*CPM

CPM-Stream

Figure 6. Comparison of Memory Efficiency

overestimation errors. It overestimates frequency of dataunder limited memory conditions. Figure compares thenumber of errors generated by these methods.

As shown in the Figure, CPM is always best and CPM-Stream is next. hCount∗ is the worst among these 3 meth-ods. The performance degradation of CPM-Stream fromCPM is due to the handling of concept drift. The randomdecrease of the counter (See under lined line in Figure 3)causes the performance degradation. However, as describedin next sub-section, this performance degradation can becompensated when handling of concept drift is necessary.

4.3 Effect of Concept Drift

Since we store the IP header log in multiple files, we canfind mass flows by analyzing each file using an off line datamining program such as CPM and hCount∗. Figure 7 showsa problem of such off line analysis. Since the occurrence ofsome flows are stored in multiple files separately, off lineanalysis underestimates the frequency of such flows. Forexample, off line analysis underestimates the frequency offlow A in Figure 7. Off line analysis can only estimate thefrequency of flow B in this case.

Figure 8, 9, and 10 show the results made by CPM-Stream using different cache sizes (6400, 25600, and102400 entries respectively). Figure shows the number offound flows with CPM-Stream and the difference of the re-sults between CPM-Stream and off line CPM. Here CPMused enough memory to find all frequent flows. CPM-Stream continuously inputs multiple files in time sequence.Thus the flows only found by off line CPM is due to theinsufficient memory and the flows only found by on lineCPM-Stream is due to the phenomenon shown in Figure 7.

As shown in Figures, the number of flows only foundby off line CPM decreases as the memory size of CPM-Stream increases. And the number of flows only found by


file N file N+1 file N+2

Flow A Flow B

Figure 7. Effect of Online Analysis

0

100

200

300

400

500

600

700

800

0 100 200 300 400 500 600 700 800 900

Num

ber

of F

ound

Flo

ws

File Number

Found FlowsOnly in Off LineOnly in On line

Figure 8. with 6400 Cache Entries

0

100

200

300

400

500

600

700

800

0 100 200 300 400 500 600 700 800 900

Num

ber

of F

ound

Flo

ws

File Number



0

100

200

300

400

500

600

700

800

0 100 200 300 400 500 600 700 800 900

Num

ber

of F

ound

Flo

ws

File Number



CPM-Stream increases as the memory size of CPM-Streamincreases. With 25,600 cache entries, the number of flowsonly found by off line CPM and the number of flows onlyfound by CPM-Stream are roughly equal. With cache en-tries more than 25,600, the number of found flows by CPM-Stream outperforms that by off line CPM due to the properhandling of concept drift.

As mentioned in Section 4.2, the handling of conceptdrift slightly decreases the memory efficiency of CPM-Stream from the off line version of CPM. However, if thetarget data requires the analysis of concept drift, the properhandling of concept drift compensates the memory effi-ciency degradation.

5 Discussion

CPM-Stream is a modified version of CPM which isoriginally suitable for off line analysis. Two important mod-ifications are the handling of concept drift and the replace-ment of the memory management strategy from random2 tohash2. The importance of concept drift handling is exam-ined in the previous section.

Figure 11 shows how random2 and hash2 select entriesto be deleted. Both memory management strategies are us-ing the retaining multiply accessed entries strategy. Theyboth select less frequently accessed entries as the candidatefor deletion. To realize this selection, random2 increases acounter (“p” in Figure 2) in the loop. Here the counter actsas the threshold of frequency. When the heap size is small,the frequency of the entries in the heap tend to be high. Bygradually increases the counter, random2 tries to select lessfrequent entries in the heap. Hash2 selects the least frequententry out from randomly selected entries. This statisticallyselects less frequent entries in the heap.

Another important characteristic of random2 and hash2


Fre

quen

cy

Items in Cache

Search by Random2

Selection by Hash2

Both Random2 & Hash2 selects less frequently accessed item

Figure 11. Selection of Random2 & Hash2

is that they both retain newly encountered data for a cer-tain period. When CPM and CPM-Stream encounter newdata, the frequency counter of the new data is set to be one.When the counter is small, they are entries to be deleted ifrandom2 and hash2 find them as the candidates for dele-tion. However, since both random2 and hash2 randomlyselect candidates for deletion, the entry for the newly en-countered data gets a postponement. If the newly encoun-tered data is frequent, the counter of the data will increaserapidly. Thus random2 and hash2 will not delete them andCPM and CPM-Stream can find new frequent data.

We believe that we can realize a different implementa-tion of the “Retaining multiply accessed entries” strategy.Random2 and hash2 are the first attempts. However, thetwo characteristics discussed above are important to designnew implementations of the “Retaining multiply accessedentries” strategy.

Another important issue related to the memory manage-ment is how to confirm the sufficiency of the memory sizeused by CPM-Stream. CPM-Stream does not guarantee thequality of its analysis. If the memory size that the CPM-Stream used was too small, the results miss many frequentitems. However, we can check its quality by comparing theresult of CPM-Stream with that of the off line version CPMas shown in Figure 8, 9, and 10. Note that the off line anal-ysis of CPM has to analyze the data which is monitored insome time period. Since the data monitored in some timeperiod is far smaller than the total data, we can use enoughmemory to find all the frequent items. Thus we can checkthe quality of CPM-Stream results by comparing it with theresults of the off line analysis.

6 Conclusion

Network management is an important issue to maintainthe Internet as an important social infrastructure. Espe-cially, finding excessive consumption of network bandwidthcaused by P2P mass flow is important. In this paper, weshow how we can use a stream mining technique to analyzeP2P mass flow with experimental results. The characteris-tics of our approach are:

• A mining technique which works well with extremelylimited memory

• The handling of concept drift to capture the currentmass flow

• The “Retaining multiply accessed entries” strategy tobest utilize the available memory resources

The effect of the handling of concept drift and behaviorof the “Retaining multiply accessed entries” strategy, i.e.random2 and hash2, are also discussed with experimentswhich use actual Internet traffic. The experiments also showthat we can use the proposed approach to find mass flowfrom Internet backbone flow data.

References

[1] http://www.caida.org/.[2] M. Charikar, K. Chen, and M. Farach-Colton. Finding fre-

quent items in data streams, 2002.[3] G. Cormode and S. Muthukrishnan. What’s hot and what’s

not: Tracking frequent items dynamically. In Proceedingsof Principles of Database Systems, pages 296–306, 2003.

[4] E. D. Demaine, A. L pez-Ortiz, and J. I. Munro. Fre-quency estimation of internet packet streams with limitedspace. In In Proc. of the 10th Annual European Symposiumon Algorithms, 2002.

[5] N. Duffi eld, C. Lund, and M. Thorup. Estimating flow dis-tributions from sampled flow statistics. In Proc. ACM SIG-COMM, pages 325–336, Karlsruhe, Germany, Aug. 2003.

[6] C. Estan and G. Varghese. New directions in traffi c mea-surement and accounting. In Proc. ACM SIGCOMM, pages323–336, Pittsburg, Aug. 2002.

[7] M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani,and J. D. Ullman. Computing iceberg queries effi ciently. InProc. 24th Int. Conf. Very Large Data Bases, VLDB, pages299–310, 24–27 1998.

[8] L. Golab, D. DeHaan, E. Demaine, A. Lopez-Ortiz, and J. I.Munro. Identifying frequent items in sliding windows overon-line packet streams. In Proc. ACM SIGCOMM InternetMeasurement Conference, Miami, USA, Oct. 2003.

[9] N. Hohn and D. Veitch. Inverting sampled traffi c. In Proc.ACM SIGCOMM Internet Measurement Conference, Mi-ami, USA, Oct. 2003.


[10] C. Jin, W. Qian, C. Sha, J. X. Yu, and A. Zhou. Dynamicallymaintaining frequent items over a data stream. In Proceed-ings of the twelfth international conference on Informationand knowledge management, pages 287–294, 2003.

[11] T. Johnson and D. Shasha. 2Q: a low overhead high perfor-mance buffer management replacement algorithm. In Pro-ceedings of the Twentieth International Conference on VeryLarge Databases, pages 439–450, Santiago, Chile, 1994.

[12] A. Kumar, J. Xu, J. Wang, O. Spatscheck, and L. Li. Space-code bloom fi lter for effi cient per-flow traffi c measurement.In Proc. IEEE infocom, Hong Kong, Mar. 2004.

[13] G. Manku and R. Motwani. Approximate frequency countsover data streams. In In Proceedings of the 28th Interna-tional Conference on Very Large Data Bases, Hong Kong,China, pages 346–357, 2002.

[14] A. Metwally, D. Agrawal, and A. E. Abbadi. Effi cient com-putation of frequent and top-k elements in data streams. InT. Eiter and L. Libkin, editors, ICDT, volume 3363 of Lec-ture Notes in Computer Science, pages 398–412. Springer,2005.

[15] J. Mirkovic, G. Prier, and P. L. Reiher. Attacking ddos at thesource. In In Proc. of the 10th IEEE International Confer-ence on Network Protocols, pages 312–321, 2002.

[16] T. Mori, M. Uchida, R. Kawahara, J. Pan, and S. Goto. Iden-tifying elephant flows through periodically sampled pack-ets. In Proc. ACM SIGCOMM Internet Measurement Con-ference, Taormina, Sicily, Italy, Oct. 2004.

[17] http://moat.nlanr.net/.[18] Y. Ohsita, S. Ata, M. Murata, and T. Murase. Detecting dis-

tributed denial-of-service attacks by analyzing tcp syn pack-ets statistically. In Proc. of IEEE Globecom 2004, 2004.

[19] E. J. O’Neil, P. E. O’Neil, and G. Weikum. The LRU-K pagereplacement algorithm for database disk buffering. In Proc.ACM SIGMOD International Conference on Management ofData, pages 297–306, 1993.

[20] Rfc2330, framework for ip performance metrics.[21] http://www.advanced.org/surveyor/.[22] G. Widmer and M. Kubat. Learning in the presence of con-

cept drift and hidden contexts. Machine Learning, 23(1):69–101, 1996.

[23] K. Yoshida, F. Adachi, T. Washio, H. Motoda, T. Homma,A. Nakashima, H. Fujikawa, and K. Yamazaki. Density-based spam detector. In KDD2004, pages 486–493, 2004.

[24] K. Yoshida, S. Katsuno, S. Ano, K. Yamazaki, M. Tsuru,and Y. Fujita. Cache-based pattern mining: Basic idea. Sub-mitted to TDM2005, 2005.


Web Usage Mining: Extracting Unexpected Periods from Web Logs

F. Masseglia, A. MarascuINRIA Sophia Antipolis

AxIS Project-Team2004 route des Lucioles06902 Sophia Antipolis

[email protected]@sophia.inria.fr

P. PonceletEMA-LGI2P/Site EERIE

Parc Scientifique Georges Besse30035 Nımes Cedex 1

[email protected]

M. TeisseireLIRMM UMR CNRS 5506

161 Rue Ada34392 Montpellier Cedex 5

[email protected]

Abstract

Existing Web Usage Mining techniques are currentlybased on an arbitrary division of the data (e.g. “one logper month”) or guided by presumed results (e.g “what isthe customers behaviour for the period of Christmas pur-chases?”). Those approaches have two main drawbacks.First, they depend on this arbitrary organization of the data.Second, they cannot automatically extract “seasons peaks”among the stored data. In this paper, we propose to performa specific data mining process (and particularly to extractfrequent behaviours) in order to automatically discover thedensest periods. Our method extracts, among the whole setof possible combinations, the frequent sequential patternsrelated to the extracted periods. A period will be consid-ered as dense if it contains at least one frequent sequentialpattern for the set of users connected to the Web site in thatperiod. Our experiments show that the extracted periodsare relevant and our approach is able to extract both fre-quent sequential patterns and the associated dense periods.

1 Introduction

Analyzing the behaviour of a Web Site users, also knownas Web Usage Mining, is a research field which consistsin adapting the data mining methods to access log filesrecords. These files collect data such as the IP address ofthe connected machine, the requested URL, the date andother information regarding the navigation of the user. WebUsage Mining techniques provide knowledge about the be-haviour of the users in order to extract relationships in therecorded data [4, 14, 16, 20]. Among available techniques,the sequential patterns [1] are particularly well adapted to

the log study. Extracting sequential patterns on a log file, issupposed to provide the following kind of relationship: “Onthe INRIA’s Web Site, 10% of users visited consecutively thehomepage, the available positions page, the ET1 offers, theET missions and finally the past ET competitive selection”.

This kind of behaviour is only supposed to exist, becauseextracting sequential patterns on a log file means to manageseveral problems (caches and proxies, great diversity ofpages on the site, search engines which allow the user todirectly access a specific part of the Web site, etc.). Amongthose problems, let us focus on the arbitrary division ofthe data which is done today. This division comes eitherfrom an arbitrary decision in order to provide one log perx days (e.g. one log per month), or from a wish to findparticular behaviours (e.g. the behaviour of the Web siteusers from November 15 to December 23, during Christmaspurchases). In order to better understand our goal, let usconsider student behaviours when they are connected for aworking session. Let us assume that these students belongto two different groups having twenty students. The firstgroup was connected on 31/01/05 while the other onewas connected on 01/02/05, (i.e. the second group wasconnected one day later). During the working session,students have to perform the following navigation: Firstthey access URL “www-sop.inria.fr/cr/tp accueil.html”,then ”www-sop.inria.fr/cr/ tp1 accueil.html” which will befollowed by ”www-sop.inria. fr/cr/tp1a.html”.Let us consider, as it is usual in traditional approaches,that we analyze access logs per month. During January,we only can extract twenty similar behaviours, among200,000 navigations on the log, sharing the workingsession. Furthermore, even when considering a range ofone month or of one year, this sequence of navigation doesnot appear sufficiently on the logs (20/20000) and will not

1ET: Engineers, Technicians.


be easy to extract. Let us now consider that we are providedwith logs for a very long period (e.g. several years). Withthe method developed in this article, we can find that itexists at least one dense period in the range [31/01-01/02].Furthermore, we know that, during this period, 340 userswere connected. We are thus provided with the new fol-lowing knowledge: 11% (i.e. 40 on 340 connected users)of users visited consecutively the URLs “tp accueil.html”,“tp1 accueil.html”, and finally “tp1a.html”.

Efficient tools are proposed today [22, 9] for analyzinglogs at different level of granularity (day, month, year).They allow for instance to know how many time the siteis accessed or how many requests have been done oneach page. Nevertheless, as they depend on the chosengranularity, they suffer the previously addressed drawback:they cannot obtain frequent patterns on a very short periodbecause usually such patterns do not appear sufficientlyon the whole log. Close to our problem, [15] propose toextract episodes rules on a long sequence as well as the theoptimal window size. Nevertheless our problem is verydifferent since we do not consider that we are providedwith a unique long sequence. In our context, i.e. accesslogs, sequences correspond to different behaviours of userson a Web Server. Then we have to manage a very hugeset of data sequences and we have to extract both frequentsequences and the period where these sequences appear.

The remainder of this paper is organized as follows. Sec-tion 2 goes deeper into presenting sequential patterns andhow they can be used on Web Usage Mining. In Section3, we give an overview of Web Usage Mining approacheswhich are based on sequential patterns. Section 4 presentsour motivation for a new approach. Our solution based ona new heuristic called PERIO is presented in Section 5. Ex-periments are reported Section 6, and Section 7 concludesthe paper with future avenues for research.

2 Definitions

In this section we define the sequential pattern miningproblem in large databases and give an illustration. Thenwe explain the goals and techniques of Web Usage Miningwith sequential patterns. The sequential pattern mining def-initions are those given by [21].

2.1 Sequential Pattern Mining

The problem of mining sequential patterns from staticdatabases is defined as follows [1]:

Definition 1 Let I = i1, i2, ..., im, be a set of m literals(items). I is a k-itemset where k is the number of items

Client d1 d2 d3 d4 d51 a c d b c2 a c b f c3 a g c b c

Figure 1. File obtained after a pre-processingstep

in I . A sequence is an ordered list of itemsets denoted by< s1s2 . . . sn > where sj is an itemset. The data-sequenceof a customer c is the sequence in D corresponding to c.A sequence < a1a2 . . . an > is a subsequence of anothersequence < b1b2 . . . bm > if there exist integers i1 < i2 <

. . . < in such that a1 ⊆ bi1 , a2 ⊆ bi2 , . . . , an ⊆ bin.

Example 1 Let C be a client and S=< (c) (d e) (h) >,be that client’s purchases. S means that “C bought item c,then he bought d and e at the same moment (i.e. in the sametransaction) and finally bought item h”.

Definition 2 The support of a sequence s, also calledsupp(s), is defined as the fraction of total data-sequencesthat contain s. If supp(s) ≥ minsupp, with a minimumsupport value minsupp given by the user, s is consideredas a frequent sequential pattern.

The problem of sequential pattern mining is thus to findall the frequent sequential patterns as stated in definition 2.

2.2 Access Log Files Analysis with Sequential Pat-terns

The general idea is similar to the principle proposed in[6]. It relies on three main steps. First of all, starting froma rough data file, a pre-processing step is necessary to clean”useless” information. The second step starts from this pre-processed data and applies data mining algorithms to findfrequent itemsets or frequent sequential patterns. Finally,the third step aims at helping the user to analyze the resultsby providing a visualization and request tool.Raw data is collected in access log files by Web servers.Each input in the log file illustrates a request from a clientmachine to the server (http daemon). Access log files formatcan differ, depending on the system hosting the Web site.For the rest of this presentation we will focus on three fields:client address, the URL asked for by the user and the timeand date for that request. We illustrate these concepts withthe access log file format given by the CERN and the NCSA[3], where a log input contains records made of 7 fields,separated by spaces [18]: host user authuser [date:time]“request” status bytes


The access log file is then processed in two steps. Firstof all, the access log file is sorted by address and by trans-action. Then each ”uninteresting” data is pruned out fromthe file.

Definition 3 Let Log be a set of server access log entries.An entry g, g ∈ Log, is a tuple:g =< ipg, ([l

g1 .URL, l

g1 .time] ... [lgm.URL, lgm.time]) >

such that for 1 ≤ k ≤ m, lgk.URL is the item asked for by

the user g at time lgk.time and for all 1 ≤ j < k, l

gk.time >

lgj .time.

The structure of a log file, as described in Definition 3,is close to the “Client-Time-Item” structure used bysequential pattern algorithms. In order to extract frequentbehaviour from a log file, for each g in the log file, wefirst have to transform ipg into a client number and for eachrecord k in g, lgk.time is transformed into a time number andfinally l

gk.URL is transformed into an item identifier. Table

1 gives a file example obtained after that pre-processing. Toeach client corresponds a series of times and the URL re-quested by the client at each time. For instance, the client 2requested the URL “f” at time d4.The goal is thus, according to definition 2 and by meansof a data mining step, to find the sequential patterns in thefile that can be considered as frequent. The result may be,for instance, < ( a ) ( c ) ( b ) ( c ) > (withthe file illustrated in Figure 1 and a minimum support givenby the user: 100%). Such a result, once mapped back intoURLs, strengthens the discovery of a frequent behaviour,common to n users (with n the threshold given for the datamining process) and also gives the sequence of events com-posing that behaviour.

3 Related Work

Several methods for extracting sequential pat-terns have been applied to Web access log files[13, 20, 2, 8, 23, 17, 14]. We report in this sectionsome studies using this temporal aspect for analyzing aWeb users behaviour.

The WUM tool [20] allows to discover navigationpatterns which are considered as “interesting” from astatistical point of view. WUM proposes to extract patternsdepending on their threshold and a user request.In [13], the authors propose WebTool. This system takesinto account all the steps of a Web Usage Mining process,from data selection to result display, via data transformationand patterns extraction. WebTool is based on a prefix tree(PSP [11]) to extract sequential patterns.In [8] the authors propose to consider the temporal aspect ofWeb accesses in a user clustering method. This clustering

algorithm is based on a sequences alignment method inorder to evaluate their distance. The main contribution isto evaluate the quality of the proposed clusters. They arecompared, during the experiments, to the clusters obtainedthanks to a distance based on itemsets.Authors of [23] consider the navigation patterns as Markovchains. They propose to build a Markov model for alink prediction method taking into account the previousnavigations. The paper is devoted to the problems relatedto Markov models and the transition matrix built for eachlog.Recent work on analyzing Web usage have focused on thequality of the results, their relevance and their utility. Thisis also the case for work related to the temporal aspectof navigation patterns. In [17] the authors show that thecharacteristics of the Web site have to be considered beforedeciding to use frequent itemsets or frequent sequences (aswell as sequential patterns). Mainly, three characteristicsare proposed: topology, connectivity degree and length ofpotential navigations. They show that sequential patternsare adapted for Web sites having long potential navigations(including Web sites involving dynamic pages).According to the authors of [14], the study of the resultquality has to consider sequential patterns with very lowsupport. High or average thresholds often lead to useless(obvious) patterns. Nevertheless, extracting sequentialpatterns with very low support is very difficult because ofthe number of candidates generated. The authors thus pro-pose to split down the problem in a recursive way in orderto consider each sub-problem as a specific data mining step.

Whatever the goals pursued by these Web Usage Miningapproaches, they always depend on the division of the data.In the next section, we propose to understand the goal ofour proposal and the general principle of our heuristic.

4 Motivation and Principle

This section is devoted to motivating our proposal re-garding the relevance and utility of the tackled knowledge.It also illustrates the issues involved and the general princi-ple of our method.

4.1 Motivation

The outline of our method is the following: enumerat-ing the sets of periods in the log that will be analyzed andthen identifying which ones contain frequent sequential pat-terns. In this section we will define the notions of period andfrequent sequential patterns over a period. Let us considerthe set of transactions in Figure 2 (upper left table). Thosetransactions are sorted by timestamp, as they would be ina log file. In this table containing 9 records, the customer


c1, for instance, has connected at time 1 and requested theURL a. Let us now consider the “in” and “out” timestampsof each client, reporting their arrival and departure (upperright table in Figure 2). The first request of client c1 hasoccurred at time 1, and the last one at time 4. We can thusreport the periods of that log. In the example of Figure 2there are 5 periods. During the first period (from time 1 totime 2), the client c1 was the only one connected on the Website. Then, clients c1 and c2 are connected during the sameperiod p2 (from time 3 to time 4), and so on.

Cust Id time URLc1 1 a

c1 2 b

c2 3 a

c1 4 d

c2 5 d

c3 6 d

c2 7 e

c3 8 e

c3 9 f

Cust Id In Outc1 1 4c2 3 7c3 6 9

Period Begin/End Customersp1 [1..2] c1

p2 [3..4] c1, c2

p3 [5] c2

p4 [6..7] c2, c3

p5 [8..9] c3

Figure 2. A log containing three sequencesand the associated periods

Let us now consider the navigation sequences of the logrepresented in Figure 2. Those sequences are reported inFigure 3, as well as the frequent sequential patterns ex-tracted on the whole log and on the identified periods. Witha minimum support of 100 % on the whole log, the onlyfrequent sequential pattern is merely reduced to the item d:< (d) >. Let us now consider the periods identified above,as well as the customers connected for each period. For theperiods p1, p3 and p5, reduced to a single client, there is norelevant frequent pattern. For the period p2 a sequential pat-terns is extracted: < (a) (d) >. This pattern is common toboth clients connected during the period p2: c1 and c2. Fi-nally, during period p4, the pattern < (d) (e) > is extracted.

The following part of this section is devoted to more for-mal definitions of period, connected clients and stable peri-ods. Let C be the set of clients in the log and D the set ofrecorded timestamps.

Definition 4 P , the set of potential periods on the log isdefined as follows:

P = (pa, pb)/(pa, pb) ∈ D × D and a ≤ b.

In the following definition, we consider that dmin(c) anddmax(c) are respectively the arrival and departure time forc in the log (first and last request recorded for c).

Definition 5 Let C(a,b) be the set of clients connectedduring the period (a, b). C(a,b) is defined as follows:C(a,b) = c/c ∈ C

and [dmin(c)..dmax(c)] ∩ [a..b] = ∅.

Finally, we give the definitions of stable period anddense period. The first one is a maximal period pm dur-ing which Cpm does not vary. With the example given inFigure 2, the period [6..7] is a stable period. This is not thecase for [3..3] which is included in [3..4] and contains thesame clients (i.e. C(3,3) = C(3,4)). A dense period is a sta-ble period containing at least a frequent sequential pattern.In the example given in section 1, the period correspondingto January 31 (i.e. during the working session) should be adense period.

Definition 6 Let Pstable be the set of stable periods.Pstable is defined as follows:

Pstable = (ma,mb)/(ma,mb) ∈ P and

1) ∃ (m′a,m′

b)/(b − a) < (b′ − a′)and [a′..b′] ∩ [a..b] = ∅and C(m′

a,m′b) = C(ma,mb)

2) ∀(x, y) ∈ [a..b],∀(z, t) ∈ [a..b]/x ≤ y, z ≤ t then C(x,y) = C(z,t).

Condition 1, in definition 6, ensures that no largest pe-riod includes (ma,mb) and contains the same clients. Con-dition 2 ensures that there is no arrival or departure insideany period of Pstable.

Definition 7 A stable period p is dense if Cp contains atleast a frequent sequential pattern with respect to the min-imum support specified by the user proportionally to |Cp|.

The notion of dense period (definition 7), is the core ofthis paper. In the following, our goal will be to extractthose periods, as well as the corresponding frequent pat-terns, from the log file. In order to give an illustration, let usconsider the period pe containing 100 clients (|Cpe

| = 100)and a minimum support of 5 %. Any sequential pattern in-cluded in at least 5 navigations of Cpe

will be considered asfrequent for that period. If there exists at least a frequentpattern in pe then this period has to be extracted by our


Cust Id Sequence log p1, p3, p5 p2 p4

c1 < (a) (b) (d) > –c2 < (a) (d) (e) > < (d) > – < (a) (d) > < (d) (e) >

c3 < (d) (e) (f ) > –

Figure 3. Frequent sequential patterns obtained for customers connected at each period

method. Extracting the sequential patterns of each periodby means of a traditional sequential pattern mining methodis not a suitable solution for the following reasons. First,sequential pattern mining algorithms (such as PSP [11] orPrefixSpan [19] for instance) can fail if one of the patternsto extract is very long. When considering navigations on aWeb site, it is usual to find numerous requests for a sameURL (pdf or php files for instance). Finally, during our ex-periments, with a total amount of 14 months of log files,we detected approximately 3, 500, 000 stable periods. Webelieve that mining dense period by means of a heuristic ismore relevant that several millions calls to a traditional al-gorithm for mining sequential patterns. The outline of ourapproach, intended to detect dense periods in the log file, ispresented in the next section.

4.2 General Principle

Figure 4. Overview of the operations per-formed by PERIO

Figure 4 gives an overview of the PERIO heuristic thatwe propose for solving the problem of dense period min-ing. First, starting from the log, the periods are detected.Those periods are then considered one by one and sortedby their “begin” timestamp. For each iteration n, the pe-riod pn is scanned. The set of clients Cpn is loaded in mainmemory (“DB” in Figure 4). Candidates having length 2are generated from the frequent items detected in Cpn (step“1” in Figure 4). Because of the large number of candidates

generated, this operation only occurs every s steps (where s

is a user defined parameter). Candidates are then comparedto sequences of Cpn in order to detect the frequent patterns(step “2” in Figure 4). Frequent patterns are injected in theneighborhood operators described in Section 5.1.1 and thenew generated candidates are compared with the sequencesof Cpn

. In order to obtain a result as fine as possible oneach period, it is possible for the user to give the minimumnumber of iteration (j) on each period.

4.3 Limits of Sequential Pattern Mining

Figure 5. Limits of a framework involving PSP

Our method will process the log file by considering mil-lions of periods (each period corresponds to a sub-log). Theprinciple of our method will be to extract frequent sequen-tial patterns from each period. Let us consider that the fre-quent sequences are extracted with a traditional exhaustivemethod (designed for a static transaction database). We ar-gue that such a method will have at least one drawback lead-ing to a blocking operator. Let us consider the example ofthe PSP [12] algorithm. We have tested this algorithm ondatabases containing only two sequences (s1 and s2). Bothsequences are equal and contain repetitions of itemsets hav-ing length one. The first database contains 11 repetitions ofthe itemsets (1)(2) (i.e. s1 =< (1)(2)(1)(2)...(1)(2) >,lentgh(s1)=22 and s2 = s1). The number of candidatesgenerated at each scan is reported in Figure 5. Figure 5 alsoreports the number of candidates for databases of sequenceshaving length 24, 26 and 28. For the base of sequences hav-ing length 28, the memory was exceeded and the process


could not succeed. We made the same observation for Pre-fixSpan2 [19] where the number of intermediate sequenceswas similar to that of PSP with the same mere databases. Ifthis phenomenon is not blocking for methods extracting thewhole exact result (one can select the appropriate methoddepending on the dataset), the integration of such a methodin our process for extracting dense periods is impossible be-cause the worst case can appear in any batch3.

5 Extracting Dense Periods

In this section, we describe the steps allowing to obtainthe dense periods of a Web access log. We also describe theneighborhood operators designed for PERIO, the heuristicpresented in this paper.

5.1 Heuristic

Since our proposal is a heuristic-based miner, our goalis to provide a result having the following characteristics:For each period p in the history of the log, let realResult

be the set of frequent behavioural patterns embeddedin the navigation sequences of the users belonging top. realResult is the result to obtain (i.e. the resultthat would be exhibited by a sequential pattern miningalgorithm which would explore the whole set of solu-tions by working on the clients of Cp). Let us nowconsider perioResult the result obtained by running themethod presented in this paper. We want to minimize∑size(perioResult)

i=0 Si/Si ∈ realResult (with Si standingfor a frequent sequence in perioResult), as well asmaximize

∑size(realResult)i=0 Ri/Ri ∈ perioResult (with

Ri standing for a frequent sequence in realResult).In other words, we want to find most of the sequencesoccurring in realResult while preventing the proposedresult becoming larger than it should (otherwise the setof all client navigations would be considered as a goodsolution, which is obviously wrong).

This heuristic is inspired from genetic algorithms andtheir neighborhood operators. Those operators are providedwith properties of frequent sequential patterns in order toproduce optimal candidates. The main idea of the PERIO

algorithm is to scan Pstable the set of stable periods and, foreach p in Pstable to propose candidates population thanksto previous frequent patterns and neighborhood operators.These candidates are then compared to the sequences of Cp

in order to know their threshold (or at least their distancefrom a frequent sequence). These two phases (neighbor-

2http://www-sal.cs.uiuc.edu/ hanj/software/prefixspan.htm3In a web usage pattern, numerous repetitions of requests for pdf or php

files, for instance, are usual.

hood operators and candidate valuation) are explained inthis section.

5.1.1 Neighborhood Operators

Figure 6. Some operators designed for ex-tracting frequent navigation patterns

The neighborhood operators we used were validated byexperiments performed on the Web logs of Inria SophiaAntipolis (see section 6). We chose ”Genetic-like” op-erators as well as operators based on sequential patternproperties. We present here some of the most efficientoperators for the problem presented in this paper. Whenwe talk about random sequence, we use a biased randomsuch that sequences having a high threshold may be chosenbefore sequences having a low threshold.

Finally, we evaluated the success rates for each of our op-erators thanks to the average number of frequent sequencescompared to the proposed candidates. An operator havinga success rate of 20 % is an operator for which 20 % of theproposed candidates are detected has frequent.

New frequent items: When a new frequent item occurs(after being requested by one or more users) it is usedto generate all possible 2-candidate sequences with otherfrequent items. The candidate set generated is thus addedto the global candidate set. Due to the number of candidatesequences to test, this operator only has a 15% ratio ofaccepted (i.e. frequent) sequences. This operator howeverremains essential since the frequent 2-sequences obtainedare essential for other operators.

Adding items: This operator aims at choosing a randomitem among frequent items and adding this item to arandom sequence s, after each item in s. This operatorgenerates length(s)+1 candidate sequences. For instance,with the sequence < (a) (b) (d) > and the frequent item c,we will generate the candidate sequences < (c) (a) (b) (d)


>, < (a) (c) (b) (d) >, < (a) (b) (c) (d) > and finally < (a)(b) (d) (c) >. This operator has a 20% ratio of acceptedsequences, but the sequences found are necessary for thefollowing operators.

Basic crossover: This operator (largely inspired bygenetic algorithms operators) uses two different randomsequences and proposes two new candidates coming fromtheir amalgamation. For instance, with the sequences <

(a) (b) (d) (e) > and < (a) (c) (e) (f) >, we propose thecandidates < (a) (b) (e) (f) > and < (a) (c) (d) (e) >. Thisoperator has a good ratio (50%) thanks to frequent se-quences embedded in the candidates generated by previousoperators.

Enhanced crossover: Encouraged by the result obtainedwhen running the previous operator, we developed a newoperator, designed to be an enhancement of the basiccrossover, and based on the frequent sequences properties.This operator aims at choosing two random sequences,and the crossover is not performed in the middle of eachsequence, but at the end of the longest prefix common tothe considered sequences. Let us consider two sequences< (a) (b) (e) (f) > and < (a) (c) (d) (e) > coming from theprevious crossover operator. The longest prefix common tothese two sequences is < (a) >. The crossover thereforestarts after the item following a, for each sequence. In ourexample, the two resulting candidate sequences are, < (a)(b) (c) (d) (e) > and < (a) (c) (b) (e) (f) >. This operatorhas a success ratio of 35%.

Final crossover: An ultimate crossover operator wasdesigned in order to improve the previous ones. Thisoperator is based on the same principle as the enhancedcrossover operator, but the second sequence is not randomlychosen. Indeed, the second sequence is chosen as being theone having the longest common prefix with the first one.This operator has a ratio of 30%.

Sequence extension: This operator is based on thefollowing idea: frequent sequences are extended withnew pages requested. The basic idea aims at adding newfrequent items at the end of several random frequentsequences. This operator has a success ratio of 60%.

Figure 6 gives an illustration of some operators describedin this section.

5.1.2 Candidate Evaluation

The PERIO heuristic is described by the following algo-rithm:

Algorithm PERIO

In: Pstable the set of stable periods.Out: SP The sequential patterns corresponding

to the most frequent behaviours.For (p ∈ Pstable)

// Update the items thresholdsitemsSupports=getItemsSupports(Cp);// Generate candidates from frequent// items and patternscandidates=neighborhood(SP , itemsSupport);For (c ∈ candidates)

For (s ∈ Cp) CandidateValuation(c, s);

For (c ∈ candidates)

If (support(c) > minSupport OR criteria)insert(c, SP );

End algorithm PERIO

Algorithm CANDIDATEEVALUATION

In: c a candidate to evaluate and s thenavigation sequence of the client.

Out: p[c] the percentage given to c.// If c is included in s, c is rewardedIf (c ⊆ s) p[c]=100+length(c);// If c, having length 2, is not included then// give c the lowest mark.If (lengthc)≤ 2) p[c]=0;// Else, give s a mark and give// largest distances a penaltyp[c]= length(LCS(c,s))∗100

length(c) − length(c);End algorithm CANDIDATEEVALUATION

For each stable period of Pstable, PERIO will generatenew candidates and then compare each candidate to the se-quence of Cp. The comparison aims at returning a percent-age, representing the distance between the candidate andthe navigation sequence. If the candidate is included inthe sequence, the percentage should be 100% and this per-centage decreases when the amount of interferences (differ-ences between the candidate and the navigation sequence)increases. To evaluate this distance, the percentage is ob-tained by the fraction of the length of the longest com-mon subsequence (LCS) [5] between s and c, on the lengthof s: |LCS(s, c)|/|s|. Furthermore, in order to obtain fre-quent sequences that are as long as possible, we use an al-gorithm that rewards long sequences if they are included inthe navigation sequence. On the other hand, the algorithm


has to avoid long, not included, sequences (in order for theclients not to give a good mark to any long sequence). Tocover all these parameters, the calculation performed by theclient machine is described in the algorithm CANDIDATEE-VALUATION. Finally evaluated candidates having eithertheir support greater than or equal to the minimal supportvalue or corresponding to a ”natural selective criteria” arestored into SP . This last criteria, which is user-defined, isa threshold corresponding to the distance between the can-didate support and the minimal support. In our case, thiscriteria is used in order to avoid than the PERIOD heuristicleads towards a local optima.

5.2 Result Summary and Visualization

Figure 7. Clustering of sequential patterns be-fore their alignment

Due to the number of candidates proposed by such aheuristic, the number of resulting sequences is very large.For instance, if the patterns <(a)(b)> and <(a)(b)(c)>are extracted by PERIO, then they will be both insertedin the result. In fact this problem cannot be reduced tothe inclusion problem. As the size of extracted patterns isvery long and as the delay of processing period has to beas short as possible, we could obtain patterns which arevery close. Furthermore, extracted patterns could be verydifferent since they represent different kind of behaviours.In order to facilitate the visualization of the issued result,we propose to extend the work of [10].

Our method performs as follows. We cluster togethersimilar sequences. This operation is based on a hierarchicalclustering algorithm [7] where the similarity is defined asfollows:

Definition 8 Let s1 and s2 be two sequences.Let |LCS(s1, s2)| be the size of the longest common subse-quence between s1 and s2. The degree of similarity betweens1 and s2 is defined as: d = 2×|LCS(s1,s2)|

|s1|+|s2| .

Step 1:S1 : <(a,c) (e) () (m,n)>S2 : <(a,d) (e) (h) (m,n)>SA12 : (a:2, c:1, d:1):2 (e:2):2 (h:1):1 (m:2, n:2):2

Step 2:SA12 : (a:2, c:1, d:1):2 (e:2):2 (h:1):1 (m:2, n:2):2S3 : <(a,b) (e) (i,j) (m)>SA13 : (a:3, b:1, c:1, d:1):3 (e:3):3 (h:1, i:1, j:1):2 (m:3, n:2):3

Step 3:SA13 : (a:3, b:1, c:1, d:1):3 (e:3):3 (h:1, i:1, j:1):2 (m:3, n:2):3S4 : <(b) (e) (h,i) (m)>SA14 : (a:3, b:2, c:1, d:1):4 (e:4):4 (h:2, i:2, j:1):3 (m:4, n:2):4

Figure 8. Different steps of the alignmentmethod with the sequences from example 2

The clustering algorithm performs as follows. Each se-quential pattern is first considered as a cluster (C.f. Step 0,Figure 7). At each step the matrix of similarities betweenclusters is processed. For instance, sequences <(a)(b)>and <(b)(c)> are similar at 50% since they share the sameitemset (b). If we now consider the two following se-quences <(a)(b)> and <(d)(e)>, their similarity is 0%.The two close clusters are either <(a)(b)>, <(b)(c)>or <(d)(e)>, <(d)(f)> since they have a same distance.They are grouped together into a unique cluster. Step “2” ofFigure 7 shows the three clusters: <(a)(b)>, <(b)(c)>,<(d)(e)> and <(d)(f)>. This process is repeated un-til there is no more cluster having a similarity greater than0 with an existing cluster. The last step of Figure 7 givesthe result of the clustering phase: <(a)(b)>, <(b)(c)> et<(d)(e)>, <(d)(f)>.

The clustering algorithm ends with clusters of similarsequences, which is a key element for sequences align-ment. The alignment of sequences leads to a weightedsequence represented as follows: SA =< I1 : n1, I2 :n2, ..., Ir, nr >: m. In this representation, m stands forthe total number of sequences involved in the alignment. Ip

(1 ≤ p ≤ r) is an itemset represented as (xi1 : mi1 , ...xit:

mit), where mitis the number of sequences containing the

item xi at the nthp position in the aligned sequences. Fi-

nally, np is the number of occurrences of itemset Ip in thealignment. Example 2 describes the alignment process on4 sequences. Starting from two sequences, the alignmentbegins with the insertion of empty items (at the beginning,the end or inside the sequence) until both sequences containthe same number of itemsets.

Example 2 Let us consider the following sequences:S1 =< (a,c) (e) (m,n) >, S2 =< (a,d) (e) (h) (m,n) >,S3 =< (a,b) (e) (i,j) (m) >, S4 =< (b) (e) (h,i) (m) >.The steps leading to the alignment of those sequences aredetailed in Figure 8. First, an empty itemset is inserted inS1. Then S1 and S2 are aligned in order to provide SA12.The alignment process is then applied to SA12 and S3.The alignment method goes on processing two sequencesat each step.


At the end of the alignment process, the aligned se-quence (SA14 in Figure 8) is a summary of the correspond-ing cluster. The approximate sequential pattern can be ob-tained by specifying k: the number of occurrences of anitem in order for it to be displayed. For instance, with the se-quence SA14 from Figure 8 and k = 2, the filtered alignedsequence will be: <(a,b)(e)(h,i)(m,n)> (corresponding toitems having a number of occurrences greater than or equalto k).

6 Experiments

PERIO was written in C++ and compiled using gcc with-out any optimizations flags. All the experiments were per-formed on a PC computer with Pentium 2,1 Ghz runningLinux (RedHat). They were applied on Inria Sophia An-tipolis logs. These logs are daily obtained. At the end of amonth, all daily log are merged together in a monthly log.During experiments we worked on 14 monthly logs. Theywere merged together in order to be provided with a uniquelog for a 14 months period (from January 2004 to March2005). Its size is 14 Go of records. There are 3.5 millionsof sequences (users), the average length of these sequencesis 2.68 and the maximal size is 174 requests.

6.1 Extracted Behaviours

We report here some of the extracted behaviours. Thosebehaviours show that an analysis based on multiple divisionof the log (as described in this paper) allows to obtainbehavioural patterns embedded in short or long periods.Execution time of PERIO on this log with a minimalsupport value of 2% is nearly 6 hours. The support of2% was the best setting for obtaining interesting patternsand limiting the size of the output. We have found 1981frequent behaviours which were grouped together on 400clusters with techniques described in Section 5.2.

Figure 10 focuses on the evolution of the following be-haviours:

• C1 =<(semir/restaurant)(semir/restaurant/consult.php)(semir/restaurant/index.php)(semir/restaurant/index.php)>

• C2 =<(eg06) (eg06/dureve 040702.pdf)(eg06/fer 040701.pdf) (eg06)>

• C3 =<(requete.php3) (requete.php3)(requete.php3)>

• C4 =<(Hello.java) (HelloClient.java)(HelloServer.java)>

• C5 =<(mimosa/fp/Skribe)(mimosa/fp/Skribe/skribehp.css)(mimosa/fp/Skribe/index-5.html)>

• C6 =<(sgp2004) (navbar.css)(submission.html)>

All itemsets of behaviour C4 are prefixed by “oa-sis/anonym2/Prog Rpt/TD03-04/hello/”. For C3 the prefixis “mascotte/ anonym3/web/td1/” and for C6 the prefix is“geometrica/events/”.

The first behaviour (C1) corresponds to a typically pe-riodic behaviour. Actually, the Inria’s restaurant has beenclosed for a few weeks and people had to order a coldmeal through a dedicated web site. This web site was lo-cated at “semir/restaurant”. C2 is representative of be-haviours related to the recent “general assembly” of Frenchresearchers, hosted in Grenoble (France, Oct 2004).

Behaviours C3 and C4 correspond to navigationperformed by students on pages about computer sciencecourses and stored on some Inria researcher pages.

When we have noticed the C5 behaviours, we asked thereasons of such behaviours to the pages owner. His inter-pretation is that such behaviours are due to the large num-ber of exchanged mails on March 2004 through the mailinglist of Skribe (generating numerous navigations on the webpages of this project). Two different peaks appear, (beginof April and middle of April) for the behaviour C6. Thosepeaks correspond in fact to the submission steps (respec-tively abstract and full papers) of articles for the SGP2004Conference.

Some of the extracted behaviours do not occur on shortperiods only. Their occurrences are frequent on severalweeks or even several months. Their support on the globallog is related to the number of customers connected for eachperiod. This is the case, for instance, of:

• C7 =<(css/inria sophia.css)(commun/votre profil en.shtml)(presentation/ chiffres en.shtml)(actu/actu scient colloque encours fr.shtml)>

The evolution of C7 is reported in Figure 9. We can observethat this behaviour occurs for 5 consecutive months (fromMay to September).

6.2 Comparison to Sequential Pattern Mining

Section 6.1 is devoted to showing some extracted be-haviours and their content. In this section we aim at show-ing a comparison between our method on the one hand, andtraditional method for sequential patterns on the other hand.We will show that the behaviours obtained by PERIO havesuch a low support that:


Figure 9. Peaks of frequency for a behaviour on a long period

Figure 10. Peaks of frequency for C1, C2, C3, C4, C5 and C6

1. They cannot be extracted by a traditional sequentialpattern mining algorithm.

2. The period they belong to cannot be identified by atraditional sequential pattern mining algorithm.

In Figure 11 we report several informations about thebehaviours presented in section 6.1. The meaning of eachvalue is given in Figure 12. We give those informations atthree granularities (year, month and day). First of all, wegive the maximum number of simultaneous occurrences ofeach behaviour in a stable period (column “Max”). Thenwe report the global support of this behaviour: the numberof sequences containing the behaviour in the whole log fileis given in column ”Global” whereas the ratio is given incolumn “%Global”.

A first comparison is given with PSP on the wholelog file for each behaviour. We report in PSPGlobal theexecution time of PSP on the whole log file with a supportof %Global. We can observe that for each behaviour, PSPis unable to extract the patterns corresponding to the givensupport. The main reason is that this support is much lowerthan any traditional method for mining sequential patternswould accept. The number of frequent items for C6 witha support of 0.0364% (bold “–”) is 935. In this case, thenumber of candidates having length 2 is 1,311,805 so themain memory was rapidly overloaded and PSP could notsucceed.

We also identified (by comparing months between eachothers) for each behaviour the month having the highestnumber of simultaneous occurrences of this behaviour in


Max Global %Global PSPGlobal Month %Month PSPMonth Day %Day PSPDay

C1 13 507 0.0197% – 08–2004 0.031% – Aug–09 0.095% 20sC2 8 69 0.0027% – 07–2004 0.004% – Jun–10 0.2% –C3 10 59 0.0023% – 07–2004 0.004 % – Jul–02 0.33% 10sC4 12 19 0.0007% – 02–2004 0.006% – Feb–06 0.35% 18sC5 10 32 0.0012% – 02–2004 0.01% – Feb–16 0.33% 21sC6 10 935 0.0364 – 02–2004 0.09% – Mar–15 0.35% 12sC7 10 226 0.0088% – 04–2004 0.01% – Apr–03 0.23s 8s

Figure 11. Supports of the extracted behaviours at 3 granularities (Global, Month & Day)

Max The maximum number of simultaneous occurrences of this behaviour in a stable periodGlobal The support (total number of occurrences) of this behaviour in the global (14 months) log file%Global The support (percentage) corresponding to Global w.r.t the number of data sequences in the global log filePSPGlobal The execution time of PSP on the global log file with a minimum support of %Global

Month The month having the highest number of simultaneous occurrences of this behaviour in a stable period%Month The support (percentage) of this behaviour on Month

PSPMonth The execution time of PSP on the log file corresponding to Month with a minimum support of %Month

Day The day having the highest number of simultaneous occurrences of this behaviour in a stable period%Day The support (percentage) of this behaviour on Day

PSPDay The execution time of PSP on the log file corresponding to Day with a minimum support of %Day

Figure 12. Legend for the table of Figure 11

a stable period. In fact, the column “Month” correspondsto the month where this behaviour has the best supportcompared to other months. We report in column %Month

the support of each behaviour on the corresponding monthand in column PSPMonth the execution time of PSP on thecorresponding month with a support of %Month. We canobserve that PSP is unable to extract the sequential patternscorresponding to each month.

Finally, we identified for each behaviour the day havingthe highest number of simultaneous occurrences of this be-haviour in a stable period (column “Day”). We report incolumn %Day the support of each behaviour on the corre-sponding day and in column PSPDay the execution time ofPSP on the corresponding day with a support of %Day . Wecan observe that, at this granularity, PSP is able to extractmost of the behaviours. Furthermore, PSP is even so fastthat it could be applied on each day of the log and the totaltime would be around 70 minutes (420 days and an aver-age execution time of approximately 10 seconds per day).Nevertheless, we have to keep in mind that with such anapproach:

1. Undiscovered periods will remain (for instance a pe-riod of two consecutive days or a period of one hourembedded in one of the considered days).

2. Undiscovered behaviours will remain (embedded in

the undiscovered periods).

3. The method would be based on an arbitrary divisionof the data (why working on each day and not on eachhour or each week or each half day?).

Finally, in order to avoid the drawbacks enumerated above,the only solution would be to work on each stable periodand apply a traditional sequential pattern algorithm. How-ever this would require several millions of calls to the min-ing algorithm and the total execution time would be around20 days (3,500,000 periods and an average execution timeof approximately 0.5 seconds per period). Furthermore (asstated in section 4.3) this solution is not satisfying becauseof the long repetitive sequences that may be embedded inthe data.

7 Conclusion

The proposition developed in this paper has shown thatconsidering a log at large, i.e. without any division ac-cording to different values of granularity like traditional ap-proaches, could provide the end user with a new kind ofknowledge cutting: periods where behaviours are particu-larly significant and distinct. In fact, our approach aimsat rebuilding all the different periods the log is made upwith. Nevertheless, by considering the log at large (sev-eral month, several years, ...) we have to deal with a large


number of problems: too many periods, too low frequencyof behaviours, inability of traditional algorithms to minesequences on one of these periods, etc. We have shownthat a heuristic-based approach is very useful in that con-text and by indexing the log, period by period, we can ex-tract frequent behaviours if they exist. Those behaviourscould be very limited on time, or frequently repeated buttheir main particularity is that they are very few on the logsand they are representative of a dense period. Conductedexperiments have shown different kind of behaviours con-cerning for instance either students, conferences, or restau-rants. These behaviours were completely hidden on the logfiles and cannot be extracted by traditional approaches sincethey are frequent on particular periods rather than frequenton the whole log.

References

[1] R. Agrawal and R. Srikant. Mining Sequential Patterns.In Proceedings of the 11th Int. Conf. on Data Engineering(ICDE’95), Tapei, Taiwan, March 1995.

[2] F. Bonchi, F. Giannotti, C. Gozzi, G. Manco, M. Nanni,D. Pedreschi, C. Renso, and S. Ruggieri. Web log datawarehousing and mining for intelligent web caching. DataKnowledge Engineering, 39(2):165–189, 2001.

[3] W. W. W. Consortium. httpd-log files. Inhttp://lists.w3.org/Archives, 1998.

[4] R. Cooley, B. Mobasher, and J. Srivastava. Data preparationfor mining world wide web browsing patterns. Knowledgeand Information Systems, 1(1):5–32, 1999.

[5] T. Cormen, C. Leiserson, and R. Rivest. Introduction to Al-gorithms. MIT Press.

[6] U. Fayad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthu-rusamy, editors. Advances in Knowledge Discovery andData Mining. AAAI Press, Menlo Park, CA, 1996.

[7] J. Han and M. Kamber. Data Mining, concepts and tech-niques. Morgan Kaufmann, 2001.

[8] B. Hay, G. Wets, and K. Vanhoof. Mining Navigation Pat-terns Using a Sequence Alignment Method. Knowl. Inf.Syst., 6(2):150–163, 2004.

[9] http Analyze. http://www.http-analyze.org/.[10] H. Kum, J. Pei, W. Wang, and D. Duncan. ApproxMAP:

Approximate mining of consensus sequential patterns. InProceedings of SIAM Int. Conf. on Data Mining, San Fran-cisco, CA, 2003.

[11] F. Masseglia, F. Cathala, and P. Poncelet. The PSP Ap-proach for Mining Sequential Patterns. In Proceedings ofthe 2nd European Symposium on Principles of Data Min-ing and Knowledge Discovery (PKDD’98), pages 176–184,Nantes, France, September 1998.

[12] F. Masseglia, F. Cathala, and P. Poncelet. The PSP Ap-proach for Mining Sequential Patterns. In Proceedings ofthe 2nd European Symposium on Principles of Data Min-ing and Knowledge Discovery, Nantes, France, September1998.

[13] F. Masseglia, P. Poncelet, and R. Cicchetti. An efficient al-gorithm for web usage mining. Networking and InformationSystems Journal (NIS), April 2000.

[14] F. Masseglia, D. Tanasa, and B. Trousse. Web usage min-ing: Sequential pattern extraction with a very low support.In Advanced Web Technologies and Applications: 6th Asia-Pacific Web Conference, APWeb 2004, Hangzhou, China.,14-17 April 2004.

[15] N. Meger and C. Rigotti. Constraint-Based Mining ofEpisode Rules and Optimal Window Sizes. In Proc. ofthe 8th European Conference on Principles and Practiceof Knowledge Discovery in Databases (PKDD), pages 313–324, Pisa, Italy, September 2004.

[16] B. Mobasher, H. Dai, T. Luo, and M. Nakagawa. Discoveryand evaluation of aggregate usage profiles for web person-alization. Data Mining and Knowledge Discovery, 6(1):61–82, January 2002.

[17] M. Nakagawa and B. Mobasher. Impact of Site Character-istics on Recommendation Models Based On AssociationRules and Sequential Patterns. In Proceedings of the IJ-CAI’03 Workshop on Intelligent Techniques for Web Person-alization, Acapulco, Mexico, August 2003.

[18] C. Neuss and J. Vromas. Applications CGI en Perl pour lesWebmasters. Thomson Publishing, 1996.

[19] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen,U. Dayal, and M. Hsu. PrefixSpan: Mining Sequential Pat-terns Efficiently by Prefix-Projected Pattern Growth. In 17thInternational Conference on Data Engineering (ICDE),2001.

[20] M. Spiliopoulou, L. C. Faulstich, and K. Winkler. A dataminer analyzing the navigational behaviour of web users.In Proceedings of the Workshop on Machine Learning inUser Modelling of the ACAI’99 Int. Conf., Creta, Greece,July 1999.

[21] R. Srikant and R. Agrawal. Mining Sequential Patterns:Generalizations and Performance Improvements. In Pro-ceedings of the 5th Int. Conf. on Extending Database Tech-nology (EDBT’96), pages 3–17, Avignon, France, Septem-ber 1996.

[22] Webalizer. http://www.mrunix.net/webalizer/.[23] J. Zhu, J. Hong, and J. G. Hughes. Using Markov Chains

for Link Prediction in Adaptive Web Sites. In Proceedingsof Soft-Ware 2002: First Int. Conf. on Computing in an Im-perfect World, pages 60–73, Belfast, UK, April 2002.


A Dissimilarity Measure for Comparing Subsets of Data:

Application to Multivariate Time Series∗

Matthew Eric Otey Srinivasan Parthasarathy

Department of Computer Science and Engineering

The Ohio State University

Contact: [email protected]

Abstract

Similarity is a central concept in data mining.Many techniques, such as clustering and classifica-tion, use similarity or distance measures to com-pare various subsets of multivariate data. However,most of these measures are only designed to findthe distances between a pair of records or attributesin a data set, and not for comparing whole datasets against one another. In this paper we presenta novel dissimilarity measure based on principalcomponent analysis for doing such comparisons be-tween such data sets, and in particular time seriesdata sets. Our measure accounts for the correlationstructure of the data, and can be tuned by the user toaccount for domain knowledge. Our measure is use-ful in such applications as change point detection,anomaly detection, and clustering in fields such asintrusion detection, clinical trial data analysis, andstock analysis.

1 Introduction

Similarity is a central concept in data mining.Research in this area has primarily progressed alongtwo fronts: object similarity [3, 17, 12] and attributesimilarity [9, 24]. The former quantifies the distancebetween two objects (rows) in the database, whilethe latter refers to the distance between attributes(columns). A related problem is that of determin-ing the similarity or dissimilarity of two subsets ofdata. Basic approaches have involved using clas-sification [15], clustering [18], and mining contrastsets [6]. However, these approaches build modelsof the data sets, instead of quantifying their dif-

∗This work is supported in part by NSF grants(CAREER-IIS-0347662) and (NGS-CNS-0406386), and agrant from Pfizer, Incorporated.

ferences. In this paper we examine the notion ofquantifying the dissimilarity between different sub-sets of data, and in particular, different multivariatetime series. We propose a novel dissimilarity mea-sure that can be used to quantify the differencesbetween two data sets.

One motivating application for such a metriccould be for analyzing clinical drug trials to detectthe efficacy and hepatotoxicity of drugs. Here onecan view each patient in the trial as a different timeseries data set, for which multiple observations atvarying time points of various analytes are measuredand stored. A dissimilarity measure in this contextcan help cluster patients into groups of similarity oralternatively detect anomalous patients. Anotherapplication could be in financial stock market anal-ysis where different subsets of the data (for example,different sectors or time periods) can be examinedfor change point detection, anomaly detection andclustering. This requires the development of a suit-able dissimilarity measure.

A suitable dissimilarity measure has several re-quirements. First, it must take into account asmuch of the information contained in the data setsas possible. For example, simply calculating the Eu-clidean distance between the centroids of two datasets is ineffective, as this approach ignores the corre-lations present in the data sets. Second, it must beuser-tunable in order to account for domain knowl-edge. For example, in some domains it may be thatdifferences in the means of two data sets may not beas important as differences in their correlation struc-tures. In this case, differences in the mean shouldbe weighted less than differences in the correlations.Third, the dissimilarity measure should be tolerantof missing and noisy data, since in many domainsdata collection is imperfect, leading to many miss-ing attribute values.

In this paper we propose a novel dissimilar-


ity metric based on principal component analysis(PCA). Our measure consists of three componentsthat separately take into account differences in themeans, correlations, and variances of the data sets(time series) being compared. As such, our mea-sure takes into account much of the information inthe data set. It is also possible to weight the com-ponents differently, so one can incorporate domainknowledge into the measure. Finally, our measure isrobust towards noise and missing data. We demon-strate the efficacy of the proposed metric in a vari-ety of application domains, including anomaly de-tection, change detection and data set clustering, onboth synthetic and real data sets.

The rest of the paper is organized as follows. Wefirst briefly review related work in Section 2. Wethen present our dissimilarity measure in Section 3,and discuss several applications of the measure. InSection 4, we present experimental results showingthe performance of our measure when used for sev-eral applications on stock market data sets. Finallyin Section 5 we conclude with directions for futurework.

2 Related Work

As mentioned above, there have been many met-rics proposed that find the distance or similaritybetween the records of a data set [3, 17, 12], orthe between the attributes of a data set [9, 24].However, these metrics are defined only betweena pair of records or attributes. Similarity met-rics for comparing two data sets have been usedin image recognition [16], and hierarchical cluster-ing [18]. The Hausdorff distance [16] between twosets A and B is the minimum distance r such thatall points in A are within distance r of some point inB, and vice-versa. Agglomerative hierarchical clus-tering frequently makes use of the single-link andcomplete-link distances between two clusters [18] todecide which pair of clusters can be merged. Thesingle-link distance between two clusters is the min-imum pairwise distance between points in clusterA, and points in cluster B, while the complete-linkdistance is the maximum pairwise distance betweenpoints in cluster A, and points in cluster B. Thereis also an average-link distance [14], which is theaverage of all pairwise distances between points incluster A, and points in cluster B. However, thesemetrics do not explicitly take into account the corre-lations between attributes in the data sets (or clus-ters). Parthasarathy and Ogihara [21] propose asimilarity metric for clustering data sets based onfrequent itemsets. By this metric, two data sets

are considered similar if they share many frequentitemsets, and these itemsets have similar supports.This metric takes into account correlations betweenthe attributes, but it is only applicable for data setswith categorical or discrete attributes.

There has also been work for defining distancemetrics that take into account the correlationspresent in continuous data. The most popular met-ric is the Mahalanobis distance [22], which accountsfor the covariances of the attributes of the data.However this can only be used to calculate the dis-tance between two points in the same data set. Yanget al [25] propose an algorithm for subspace cluster-ing (i.e. subsets of both points and attributes ina data set) that finds clusters whose attributes arepositively correlated with each other. Bohm et al [7]modify the dbscan algorithm [11] by using PCA tofind clusters of points that are not only density-connected, but correlation-connected as well. Thatis to say, they find subsets of a data set that havesimilar correlations. To determine if two points ofthe data set should be merged into a single cluster,they must be in each other’s “correlation” neighbor-hood which is determined by a PCA-based approxi-mation to the Mahalanobis distance. This approachis more flexible than Yang et al’s in that it can findclusters with negative correlations between the at-tributes. However, their measure is unable to findsubsets of data with similar correlations that are notdensity-connected. Furthermore, both Yang et al’sand Bohm et al approaches are interested only infinding clusters of points within a single data set, in-stead of clustering multiple data sets. Finally, Yangand Shahabi [26] use an extension to the Frobeniusnorm called Eros to calculate the similarity of twotime series. A component of our similarity measureis very similar to Eros (see Section 3.1.2). UnlikeEros, however, our measure contains other compo-nents they do not consider. For example, they donot consider the differences in the means of the twotime series. Furthermore, in approach, the weightsof the different components can be adjusted basedon domain knowledge.

Recently, Aggarwal has argued for user interac-tion when designing distance functions [2] betweenpoints. He presents a parametrized Minkowski dis-tance metric and a parametrized cosine similaritymetric that can be tuned for different domains. Healso proposes a framework for automatically tuningthe metric to work appropriately in a given domain.Based on these ideas in the next section we presenta tunable metric for computing a measure of dis-similarity across data (sub)sets.


3 Algorithms

In this section we first present our dissimilaritymeasure and demonstrate its effectiveness with asmall example data set. We then discuss variousapplications of our dissimilarity measure in detailthat demonstrate its utility and flexibility.

3.1 Dissimilarity Measure

Our goal is to quantify the dissimilarity of twohomogeneous k-dimensional data sets X and Y.This measure of dissimilarity should take into ac-count not only the distances between the datapoints in X and Y, but the correlations betweenthe attributes of the data sets as well.

In general, the dissimilarity of two data sets X

and Y is denoted as D(X,Y). We define the func-tion D in terms of three dissimilarity functions thattake into account the differences in location, rota-tion, and variance between the data sets. Each ofthese components are discussed separately below.These three components are combined by means ofa product, or by a weighted sum, which allows oneto weight the components differently, so as to in-corporate domain knowledge. For example, in thedomain of network intrusion detection, one may beconcerned with time series data sets where column i

represents the ith computer on a given subnetwork,and row j represents the number of bytes receivedbetween times tj−1 and tj . When comparing sub-sets of this data set taken from different time points,large differences in the mean may be indicative ofa denial-of-service attack. Alternatively, differencesin the correlation of the number of bytes received bytwo different machines may be indicative of one ofthe machines being used by an unauthorized user.Depending on what the user wishes to detect, themeasure can be tuned in different ways.

3.1.1 Distance Component

To determine the distance between two data sets,there are a wide variety of distance metrics we canuse. We have implemented several different distancemetrics, including the single-link and complete-linkdistances, among others (see Section 2). In thiswork we consider two distance measures for the cen-troids of the data sets. The Euclidean distance be-tween the centroids of each data set is given by:

Dd(X,Y) = |µX − µY|2. (1)

The other distance measure we use is the Maha-lanobis distance, given by:

Dd(X,Y) = (µX − µY)Σ−1

XY(µX − µY)T (2)

where ΣXY is the covariance matrix of the combi-nation of data sets X and Y.

3.1.2 Rotation Component

The next component measures the degree to whichthe data set X must be rotated so that its principlecomponents point in the same direction as those ofY. The principal components of a data set are theset of orthogonal vectors such that the first vectorpoints in the direction of greatest variance in thedata, the second points in the orthogonal directionof the second greatest variance in the data, and soon [20, 23]. We consider X and Y to be most simi-lar to each other when their principal components,paired according to their ranks, are aligned, andmost dissimilar when all of the components of X

are orthogonal to those of Y.More formally, given a data set X, consider the

singular value decomposition (SVD) of its covari-ance matrix:

cov(X) = UΛXXT (3)

where the columns of X are the principal compo-nents of the data set X, arranged from left to rightin order of decreasing variance in their respectivedirections, and ΛX is the diagonal matrix of sin-gular values (eigenvalues). Note that one can alsofind the singular value decomposition of the correla-tion matrix of X as an alternative to the covariancematrix. To determine the rotation dissimilarity be-tween the two data sets X and Y, we measure theangles between their principal components.

Since the columns of X and Y are unit vectors,it follows that the diagonal of the matrix X

TY is

the cosine of the angles between the correspondingprincipal components, and so our rotation dissimi-larity measure Dr is defined as the sum of the anglesbetween the components:

Dr(X,Y) = trace(cos−1(abs(XTY ))). (4)

Since the signs of the principal components can beignored, taking the absolute value ensures that wewill only be concerned with acute angles. It canbe easily shown that if X and Y are n-dimensionaldata sets, then Dr(X,Y) only takes on values in theset [0,

nπ2 ], where a value of 0 infers that the prin-

cipal components are exactly aligned according to


the size of their corresponding eigenvalues, a valueof nπ

2 infers that the principal components are com-pletely orthogonal. We note that Dr is very simi-lar to the Eros similarity measure presented in [26].The central difference is that we take the arc cosineof X

TY so that Dr measures dissimilarity instead

of similarity as Eros does.Note that the rotation dissimilarity measure Dr

also accounts for some aspects of the differences inthe covariance structures of X and Y, since it mea-sures the amount of rotation needed so that theirrespective principal components are aligned in or-der of decreasing variance. However, we still mustaccount for the amount of variance in each direction,or the “shape” of the data sets.

3.1.3 Variance Component

We note that data sets can have different “shapes.”For example, in two dimensions, a data set withlittle or no correlation between its attributes hasa scatter plot that is circular in shape, while thepoints of a data set with maximum correlation alllie along the same line. It may be the case that theprincipal components of X and Y, are completelyaligned, but they still have very different shapes.For example, consider data sets C and E in Figure 1.It will be shown in Section 3.2 that the principalcomponents of C and E are nearly aligned, but itis obvious to see that they have different variancestructures by looking at the shapes of their plots:data set C has a short ovular shape, while E is muchmore elongated.

To account for these differences in the shapes ofthe data sets, we examine the difference in the dis-tributions of the variance over the principal compo-nents of X and Y. More formally, consider the ran-dom variable VX having the probability mass func-tion:

P (VX = i) =λ

Xi

trace(ΛX)(5)

where ΛX is the diagonal matrix of singular valuesfrom Equation 3, and λ

Xi is the ith singular value.

P (VX = i) is then the proportion of the variance inthe direction of the ith principal component. Wecan then compare the distributions of VX and VY

by finding the symmetric relative entropy:

SRE(VX, VY) =12(H(VX‖VY) + H(VY‖VX)) (6)

where H(X‖Y ) is the relative entropy of two ran-dom variables X and Y . The relative entropy is acommon measure of the distance between two prob-

ability distributions [8]. We can then define the vari-ance dissimilarity as the symmetric relative entropy:

Dv(X,Y) = SRE(VX, VY). (7)

3.1.4 Final Dissimilarity Metric

The dissimilarity between X and Y can now be de-fined in two different manners. Our basic formula-tion is given by:

DΠ(X,Y) = Dd × Dr × Dv. (8)

A more flexible formulation is as a linear combina-tion of the components, given by:

DΣ(X,Y) = β0 +βd×Dd +βr×Dr +βv×Dv. (9)

This formulation allows the components to beweighted differently (or completely ignored) bymeans of varying the values of their coefficients (i.e.β). To avoid an unwanted bias towards one ormore of the components, the coefficients must cho-sen to normalize their respective components. Thisis straightforward for some components (for exam-ple Dr only takes on values in the range [0,

nπ2 ]),

but not for others (for example, when using the Eu-clidean distance for Dd on non-normalized data).

Since the coefficients allow the components to beweighted differently, a user can bias the measureto reflect domain knowledge. For example, DΣ re-duces to the basic Euclidean distance between thecentroids of the data sets when βd is set to 1 and theothers are set to 0. However, on the other extreme,one may be more concerned with finding data setswith similar covariance structures, but may not beconcerned with with relative locations of the datasets, and so βr and βv can be set to some positivevalue, while βd is set to 0.

3.1.5 Missing Data

Our measure is also robust to missing data. If a dataset X has records with missing attribute values, andassuming that the data has a normal distribution,one can use the Expectation-Maximization [10] al-gorithm to find the maximum-likelihood values ofthe centroid µX and the covariance matrix cov(X).The principal components one finds are the sam-ple principal components [19], and one can developconfidence intervals to test the closeness to the true(population) principal components. If the miss-ing data is not excessive, then the maximum like-lihood/sample estimates of the components will beaccurate, and the computation of the dissimilaritymetric can continue as before. Other approaches for


-500

-400

-300

-200

-100

0

100

200

300

400

500

-800 -600 -400 -200 0 200 400 600 800

Synthetic Data Sets

ABCDE

Figure 1. A plot of five synthetic data sets.

A B C D EA – 511.43 5.3 854.06 867.04B 511.43 – 512.87 604.31 617.37C 5.3 512.87 – 858.64 871.64D 854.06 604.31 858.64 – 13.69

E 867.04 617.37 871.64 13.69 –

Table 1. Dissimilarity: distance compo-nent.

handling missing data involve just ignoring recordswith missing data completely. In Section 4.5 wepresent results that show simply ignoring missingdata does not drastically affect the performance ofour measure.

3.2 Example

In this example, we will look at each componentin turn to show it influences the final value of thedissimilarity measure. Consider Figure 1, wherewe have plotted five different data sets labeled Athrough E. Each data set is similar to the othersin different ways. For example, sets A and E havesimilar shapes, D and E have similar centroids, andB and D have similar slopes.

In Table 1 we present the pairwise distance dis-similarities of the data sets. The bold face valuesrepresent the minimal dissimilarities between datasets. As we expect, data sets A and C are consider-ably similar to each other according to this measure,as are data sets D and E, while data set B is consid-erably dissimilar from all the other data sets. Notethat while data sets A and C have similar means,they have extremely different covariance structuresthat are not taken into account by this measure.

In Table 2 we present the pairwise rotation dis-

A B C D EA 0 0.67 2.53 0.4 2.55B 0.67 0 3.09 0.27 3.06C 2.53 3.09 0 2.93 0.02

D 0.4 0.27 2.93 0 2.95E 2.55 3.06 0.02 2.95 0

Table 2. Rotation dissimilarity.

A B C D EA 0 0.18 0.20 0.10 0.000009

B 0.18 0 0.0007 0.01 0.18C 0.20 0.0007 0 0.01 0.21D 0.096 0.009 0.014 0 0.099E 0.000009 0.18 0.21 0.10 0

Table 3. Variance dissimilarity.

similarities of the data sets. As we expect, datasets A, B, and D are very similar to each other, sincetheir principal components are pointed in nearly thesame directions. We note that the most similar pairof data sets according to this measure is E and C,while according to the distance dissimilarity mea-sure they are the most dissimilar pair of data sets.

In Table 3, we present the pairwise variance dis-similarities. In this case, data sets A and E arevery similar to each other, which is expected, sincethe plots of each are both long and thin. We alsonote that while data sets B and C are very similarto each other according to the variance dissimilar-ity measure, they are also the most dissimilar pairaccording to the rotation dissimilarity measure.

In Table 4, we present the total pairwise dissim-ilarity of the data sets. In this case we use theproduct form (DΠ) of our measure. We find thatdata sets A and E are the most similar, due to thehigh similarity of the distribution of their variancesacross their principal components. Data set E isnext most similar to data set D due to the prox-imity of their means, and E is also quite similarto data set C, since their principal components arerotated similarly. E is most dissimilar to data set

A B C D EA 0 60.07 2.73 32.94 0.02

B 60.07 0 1.14 1.38 341.14C 2.73 1.14 0 36.07 4.52D 32.94 1.38 36.07 0 3.99E 0.02 341.14 4.52 3.99 0

Table 4. Total dissimilarity (DΠ).


procedure FindChangePoints(series T , int W1, int W2)beginfor each point t ∈ T

Before = the W1 points occurring before tAfter = t∪ the W2 − 1 points occurring after tScore[t] = D(Before, After)

endFilter Score to find maximaReturn the t corresponding to maxima of Score

end.

Figure 2. The change point detection algo-rithm.

B due to large differences in their respective means,rotations, and variances. However, a basic distance-based dissimilarity measure (for example, using justDd) would rank B as the second-most similar dataset to E (after D), as can be seen from Table 1.

3.3 Applications

In this section we present an overview of how ourdissimilarity measure can be used in several com-mon data analysis techniques. The techniques weconsider are change point detection, anomaly de-tection, and data set clustering.

3.3.1 Change Point Detection

One application of our dissimilarity measure ischange point detection. In change point detection,one wants to find the point(s) in a time series wherethere has been an abrupt change in the process gen-erating the series [5]. Our algorithm for off-linechange point detection for multivariate time seriesis presented in Figure 2. It works by scanning overa time series T , comparing two successive windowsof data points, the first of size W1 and the secondof size W2 data points, using our dissimilarity mea-sure D. It returns the maxima of D applied over T .It follows that the maximum value of D is achievedwhen the two successive windows are most differentwith respect to their means, rotations, or variances,signaling that the underlying distribution generat-ing the time series has changed between the two win-dows. We present the experimental results of run-ning change point detection on stock market datain Section 4.2.

3.3.2 Anomaly Detection

A closely related problem to change point detectionis anomaly detection. Whereas change point detec-

tion seeks to discover points that mark a shift fromone generating process to another, anomaly detec-tion seeks to discover points that are outliers withregard to the current generating process. Outlierdetection algorithms work by assigning an anomalyscore to each point in a data set based on its dis-similarity to the other points in the set. The mostdissimilar ones are marked as outliers. Since ourmeasure is designed to measure the dissimilarity be-tween a pair of data sets, we cannot directly mea-sure the dissimilarity between a point and a dataset. However, we can use our measure to assign ananomaly score to a point:

SX(x) = D(X,X− x). (10)

The anomaly score function SX(x) measures howmuch the mean and covariance structure of X wouldchange if the data point x was removed. If the valueof SX(x) is large, then x must be introducing con-siderable distortion into the model of X.

We demonstrate the utility of our dissimilar-ity measures for outlier detection using the aboveapproach with a toy data set. We compare ourmeasures to the basic Mahalanobis distance met-ric, since it also incorporates information concern-ing the covariance structure of the data set (similarto the formulation in Equation 10, we calculate thedistance from a point x to the centroid of X − x

using the covariance matrix of X − x):

SX(x) = (µX−x − x)Σ−1

X−x(µX−x − x)T

. (11)

Our data set contains 150 points, and we find thetop 15 outliers according to each measure. The re-sults can be seen in Figure 3. In these plots, normalpoints are denoted by pluses and outliers are de-noted by stars. In Figure 3(A) we show the outliersdiscovered using the Mahalanobis distance metric.In Figures 3(B) and (C) we show the outliers dis-covered using our DΠ and DΣ measures respectively(in the case of DΣ, we have chosen the β’s so thatthe components are normalized). As we expect, theresults are quite similar, since they all take into ac-count both the means and the covariances of thedata. However, unlike the Mahalanobis distancemetric and DΠ measure, the DΣ measure is muchmore flexible, as the user is able to chose the valuesof the β’s. This flexibility is demonstrated in Fig-ures 3(D)-(F), where we detect outliers using onlythe distance, rotation, and variance components re-spectively by setting the coefficient of the relativecomponent to 1 and the others to 0. In each case,different outliers are found. For example, using the


-500

-400

-300

-200

-100

0

100

200

300

400

500

-800 -600 -400 -200 0 200 400 600 800

(A) Mahalanobis Distance

NormalOutlier

-500

-400

-300

-200

-100

0

100

200

300

400

500

-800 -600 -400 -200 0 200 400 600 800

(B) Dissimilarity (Product)

NormalOutlier

-500

-400

-300

-200

-100

0

100

200

300

400

500

-800 -600 -400 -200 0 200 400 600 800

(C) Dissimilarity (Sum)

NormalOutlier

-500

-400

-300

-200

-100

0

100

200

300

400

500

-800 -600 -400 -200 0 200 400 600 800

(D) Dissimilarity (Sum - Distance Only)

NormalOutlier

-500

-400

-300

-200

-100

0

100

200

300

400

500

-800 -600 -400 -200 0 200 400 600 800

(E) Dissimilarity (Sum - Rotation Only)

NormalOutlier

-500

-400

-300

-200

-100

0

100

200

300

400

500

-800 -600 -400 -200 0 200 400 600 800

(F) Dissimilarity (Sum - Variance Only)

NormalOutlier

Figure 3. Outliers in a data set discovered using different measures: (A) Mahalanobis; (B) DΠ;(C) DΣ; (D) DΣ (Euclidean) distance component only; (E) DΣ rotation component only; (F) DΣ

variance component only.

distance component only (Figure 3(D)), the outliersare those points on the extreme ends of the “arms”of the data set, whereas when we use the rotationcomponent only (Figure 3(E)), the points not be-longing to any of the “arms” are marked as outliers.

One can also use an alternative incremental formof anomaly detection that is applicable in domainswhere data sets are streaming or in the form of timeseries. In this form, one calculates D(X,X ∪ x),where X is a sliding window of k data points, and x

is the first data point following the window. This issimilar to change point detection, except that it onlyconcerns information that arrives prior to x. Wepresent experimental results of using this approachwith our dissimilarity measure on stock market datain Section 4.3.

3.3.3 Data Set Clustering

One of the advantages of a dissimilarity measurefor data sets is that it allows one to cluster the datasets into groups with similar means or variances, de-pending on how one weights the components. As amotivating example, consider a large business orga-nization such as Wal-Mart, with national or interna-

tional interests. Such organizations usually rely ona homogeneous distributed databases to store theirtransaction data. This leads to time varying, dis-tributed data sources. In order to analyze such acollection of databases, it seems important to clus-ter them into small number of groups to contrastglobal trends with local trends so as to develop ad-vertising campaigns targeted at specific clusters.

It is straightforward to perform agglomerative hi-erarchical clustering of data sets using our dissim-ilarity measure. If one has n data sets, one canconstruct an n by n table containing the pairwisedissimilarities of the data sets. Once this table hasbeen constructed, one can use any distance metric(e.g. single-link or complete-link) to perform the hi-erarchical clustering. We present experimental re-sults on using hierarchical clustering for stock mar-ket data in Section 4.4. This table also facilitatesnon-hierarchical clustering approaches, such as thek-medoid approach [14]. This works by selectingseveral data sets at random to be medoids of theclusters, and then assigning the remaining data setsto a cluster with the most similar medoid. Afterthis phase, the medoids are checked to see if replac-


1.8

1.9

2

2.1

2.2

2.3

2.4

2.5

2.6

2.7

7-28-012-28-0110-4-00

Po

ints

(n

orm

aliz

ed

)

Date

Stock Indices

Dow JonesS & P 500

Figure 4. Plot of stock indices centered onchange point.

ing any of them with other data sets would reducethe dissimilarity in their respective clusters. If so,the process repeats until no medoids are replaced,or some other stopping criterion is met.

4 Experimental Results

4.1 Setup

In our experiments we utilize historical stockmarket data available from Yahoo! Finance [1].We constructed several multivariate time series datasets, where each dimension is the adjusted closingprice of a stock or stock index. The stock indicesthat we use are the Dow Jones (DJ), Standard andPoor’s 500 (S&P 500), and the 10-year TreasuryNote (TN) indices from January 1962 until May2005. We also used the stock prices of a set ofsix pharmaceutical companies (Abbott Laborato-ries, GlaxoSmithKline, Johnson and Johnson (J &J), Merck, Pfizer, and Wyeth) from August 1986 un-til June 2005. All of our implementations are doneusing Octave, an open-source version of Matlab.

4.2 Change Point Detection

In our first set of experiments, we examined ourmeasure’s effectiveness when used for change pointdetection. One of our more impressive results comesfrom a bivariate data set containing the values of theDow Jones and S&P 500 indices. In this experimentwe normalized the data and set the window sizes W1

and W2 both equal to 100. We derived the princi-pal components from the covariance matrix of thedata. We eliminated the scores of the first and last

1.8

1.9

2

2.1

2.2

2.3

2.4

2.5

2.6

2.7

1.8 1.9 2 2.1 2.2 2.3 2.4 2.5

S &

P 5

00

Dow Jones

Stock Indices

BeforeAfter

Figure 5. Scatter plot of stock indices.

months in order to avoid edge effects. The highest-scoring change point according to our dissimilaritymetric (DΠ) occurred on February 28, 2001. In Fig-ure 4 we plot these two indices versus time, showing100 points on either side of the change point. Fromthis figure we can see that the indices become morehighly correlated after the change point. The dif-ference is more obvious in Figure 5. Here the valuesof the indices are plotted against each other, andthe markers indicate whether the point comes frombefore or after the change point. As can be seen,the points fall into two distinct clusters dependingon whether they come before or after the changepoint. We note that when we perform SVD usingthe correlation matrices instead of covariance matri-ces of the data, the results are very similar, thoughthe change points may be shifted by a few instances.For the example above, when we use the correlationmatrix, we calculate the change point as February23, 2001.

4.3 Anomaly Detection

We test our incremental outlier detection algo-rithm on several data sets: Indices, which con-tains all three stock indices (DJ, S&P 500, andTN); DJ/S&P 500 and DJ/TN, which contain onlythe two relevant indices; Pharm., which containsall six pharmaceutical stocks (see Section 4.1); andPfizer/Merck, which contains only the two relevantindices. In our experiment we use DΠ, performingSVD on the covariance matrices, and vary the valueof k (the size of the sliding window) over 12 differentvalues (4, 5, 6, 8, 10, 15, 20, 30, 40, 60, 80, 100),and mark the dates of the top 30 outliers for eachvalue of k, creating 12 lists of 30 dates each.

In Figure 6 we plot Pfizer’s and Merck’s stockprices during the year 2004. The vertical lines mark


Date Description Indices DJ/S&P 200 DJ/TN Pharm. Pfizer/Merck10-19-87 Market crash 92% (100%) 25% (58%) 92% (92%) 17% (75%) 25% (83%)3-16-00 Largest DJ increase 0% (0%) 50% (0%) 8% (0%) 17% (0%) 0% (0%)4-14-00 Largest DJ decrease 33% (8%) 58% (25%) 50% (17%) 0% (0%) 0% (0%)9-17-01 WTC attack 8% (58%) 25% (67%) 42% (33%) 0% (0%) 0% (0%)9-30-04 Vioxx TMwarning 0% (0%) 0% (0%) 0% (0%) 58% (100%) 92% (100%)12-17-04 Celebrex TMwarning 0% (0%) 0% (0%) 0% (0%) 8% (0%) 75% (33%)

Table 5. Detection rates of notable outliers using the DΠ measure and a Mahalanobis metric-based DΣ measure (in parenthesis).

12-179-303-11

Price

(N

orm

aliz

ed)

Date

Pfizer/Merck

PfizerMerck

Figure 6. Outliers in 2004 Merck and Pfizerstock prices.

0.7

0.75

0.8

0.85

0.9

1.2 1.22 1.24 1.26 1.28 1.3 1.32 1.34 1.36 1.38 1.4

Me

rck

Pfizer

Pfizer/Merck

Before 3-113-11

Figure 7. The Pfizer/Merck March 11 outlierand preceding days.

the days of the outliers. 2004 was tumultuous yearfor these stocks, as it contains six of the top 30 out-liers according to our measure when we set k equalto 15. Note that our measure is able to detect largechanges in the means, as is the case for the Septem-ber 30 and December 17 outliers, as well as moresubtle outliers, such as the one occurring on March11. In Figure 7 we show a scatter plot of Pfizer andMerck stock prices for March 11 and the 15 tradingdays preceding it. This clarifies why March 11 ismarked as an outlier: In the 15 trading days priorto March 11, the Pfizer’s and Merck’s stock priceswere relatively uncorrelated, but on March 11, theprices of both sank sharply.

We also verify that our dissimilarity measurescan detect known anomalies. To do this we searchthe 12 lists of outliers for several well-known dates.For example, we pick October 19, 1987, since thestock market suffered a major crash on that day,and March 16, 2000, as that day currently holdsthe record for the largest increase of the Dow Jonesindex. We also pick September 30, 2004 and Decem-ber 17, 2004, as those are the days when informa-tion was announced concerning serious side-effectsof Merck’s VioxxTMand Pfizer’s CelebrexTMdrugs,respectively1. We compare the basic DΠ measureagainst the DΣ measure that uses the Mahalanobismetric for the distance component (see equation 2).

In Table 5 we present our results as the percent-age of the lists in which each date appeared for allof the data sets for both the DΠ measure and theMahalanobis-based DΣ measure (which is given inparenthesis). The measures discern these anoma-lous days fairly well. The first four rows indicateanomalous days for the overall stock market, andthe anomalies are reflected in the market index datasets. The last two rows indicate anomalous daysfor the Pharmaceutical sector, as the announcement

1Vioxx is a trademark of Merck and Company, Incorpo-rated. Celebrex is a trademark of Pfizer, Incorporated.


DNOSAJJMAMFJ

Price

(N

orm

aliz

ed

)

Date

Pfizer/Merck

PfizerMerck

Figure 8. Pfizer/Merck 2004 stock prices.

concerning VioxxTMand CelebrexTMhad adverse ef-fects on the price of Merck’s and Pfizer’s stocks,respectively. However, the effect of the announce-ments was not so great on the values of the marketindices.

We note that in some cases, the Mahalanobis-based DΣ approach out-performs the basic DΠ

measure, but in other cases, the DΠ mea-sure out-performs the Mahalanobis-based ap-proach. For example, the Mahalanobis-based ap-proach more consistently detects the October 19,1987 market crash and the September 30, 2004VioxxTMannouncement, while the DΠ measuremore consistently detects the largest Dow Jones in-crease (March 16, 2000) and decrease (April 14,2000). The reason is manner in which the two ap-proaches detect anomalies. The Mahalanobis-basedapproach is biased towards the Mahalanobis dis-tance from stock prices on the current day to themean of the prices on the previous k days. There-fore, it is good at detected larges changes in mean.The DΠ measure, however, also detects changes inthe correlations. The Dow Jones anomalies involvea large change in the mean value of the Dow Jonesindex, but this is not as drastic as the change incorrelation that results when it is paired with otherindices that do not have such large changes in meanvalue.

4.4 Data Set Clustering

In our next experiment, we examine the effects ofusing our dissimilarity measure to perform agglom-erative hierarchical clustering. Note, as mentionedearlier, one can also use k-medoids clustering here.Currently, we use the Pfizer/Merck data set, andextract the records for each month during the year2004 to form 12 separate data sets (see Figure 8).

Figure 9. Dendrogram resulting from clus-tering of monthly data sets.

Our original intent was to demonstrate the effec-tiveness of this approach on clustering clinical trialpatient data from Pfizer, Incorporated, and use itas a mechanism for detecting hepatotoxicity signals.However, due to delays in getting permission to pub-lish results from this data, we are unable to includethese results at this point in time.

We build a table of pair-wise dissimilarities be-tween the monthly Pfizer/Merck data sets using theDΣ measure, with a slight bias towards the dis-tance component to account for the drop in stockprices in the latter part of the year. Using thistable, we perform hierarchical clustering using thesingle-link distance metric. The dendrogram result-ing from the clustering can be seen in Figure 9.The results are expected: the data sets for Januarythrough June are clustered early, as they have sim-ilar means and positive correlations, and the datasets for October through December are not clus-tered together until very near the end, due to theirlarge differences in means compared to the othermonths. We see that October and December clus-ter with each other first, which is notable since theyare months most influenced by the VioxxTMandCelebrexTMannouncements, respectively.

4.5 Robustness to Missing Data

Finally, we examine the effects of ignoring recordscontaining missing data. In this experiment we usedthe DJ/S&P 500 data set, and progressively re-moved 1%, 5%, 10%, 15%, and 20% of the records(i.e. that data set with 20% of the records removedis a subset of the data set with 15% of the recordsremoved, and so on). For each of these data sets, wecalculated the top 20 change points using the algo-


0% 1% 5% 10% 15% 20%0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Effect of Missing Data

15

40

100

Percentage of Missing Data

Pe

rce

nta

ge

of

Ch

an

ge

Po

ints

Fo

un

d

Figure 10. Percentage of change pointsfound for differing degrees of missingdata.

rithm in Section 3.3.1, with W equal to 15, 40, and100. We then compared each set of 20 change pointsto the top 20 change points found when there is nomissing data. We counted the number of changepoints that matched to within some time window(on the order of one week for W equal to 15, andone month for W equal to 100). The percentage ofcorrect matches for each data set and each value ofW is presented in Figure 10. Our detection ratesfrom a high of 100% (all change points found) for1% missing data to a low of 70% when 15% to 20%of the data is missing. Note that in this experi-ment we assume we do not know which records aremissing–we calculate the change point based on theW non-missing records coming before the point, andthe W non-missing records coming after it. There-fore, the change point is calculated using only a sub-set of the records used if there was no missing data,plus some “extra” records would not be considered ifthere were no missing data. This fact, coupled withour detection rates, indicates that our approach isfairly robust missing data.

5 Conclusion

In this paper we presented a dissimilarity mea-sure for data sets that takes into account the meansand correlation structures of the data sets. Thisdissimilarity measure is tunable, allowing the userto adjust its parameters based on domain knowl-edge. The measure has many different applicationsfor time series analysis, including change point de-tection, anomaly detection, and clustering, and ourexperimental results on time series data sets showits effectiveness in these areas. In future we want use

our dissimilarity measure to detect anomalous datasets. This is applicable to clinical trial data, wherepatients are represented by a multivariate time se-ries of blood analyte values, and detection of anoma-lous patients can lead to early discovery of possiblyserious side effects of the drug being tested. We haveconducted experiments on this, and our results areextremely promising. However, at this point we donot have permission to share these results. We alsoplan to explore the incremental aspects of our mea-sure in order to apply to dynamic and streamingdata sets. For example, the computational costs ofcalculating our dissimilarity measure on dynamic orstreaming can be reduced by using incremental PCAtechniques [4, 13]. Such incremental techniques canalso enhance execution speeds performing anomalydetection and change point detection off-line, wheresliding windows are used to scan the data.

References

[1] Yahoo! finance. In http://finance.yahoo.com.[2] C. C. Aggarwal. Towards systematic design of

distance functions for data mining applications.In Proceedings of the Ninth ACM SIGKDD Inter-national Conference on Knowledge Discovery andData Mining, pages 9–19, August 2003.

[3] R. Agrawal, C. Faloutsos, and A. N. Swami. Ef-ficient similarity search in sequence databases.In FODO ’93: Proceedings of the 4th Interna-tional Conference on Foundations of Data Orga-nization and Algorithms, pages 69–84. Springer-Verlag, 1993.

[4] M. Artac, M. Jogan, and A. Leonardis. Incremen-tal pca or on-line visual learning and recognition.In ICPR, 2002.

[5] M. Basseville and I. V. Nikiforov. Detection ofAbrupt Changes: Theory and Application. Pren-tice Hall, 1993.

[6] S. D. Bay and M. J. Pazzani. Detecting change incategorical data: Mining contrast sets. In Knowl-edge Discovery and Data Mining, pages 302–306,1999.

[7] C. Bohm, K. Kailing, P. Kroger, and A. Zimek.Computing clusters of correlation connected ob-jects. In Proceedings of the 2004 ACM SIGMODinternational conference on Management of data,pages 455–466. ACM Press, 2004.

[8] T. M. Cover and J. A. Thomas. Elements of In-formation Theory. John Wiley and Sons, 1991.

[9] G. Das, H. Mannila, and P. Ronkainen. Similar-ity of attributes by external probes. In KnowledgeDiscovery and Data Mining, pages 23–29, 1998.

[10] A. P. Dempster, N. M. Laird, and D. B. Rubin.Maximum likelihood from incomplete data via theEM algorithm. Journal of the Royal Statistical So-ciety, B39:1–38, 1977.


[11] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. Adensity-based algorithm for discovering clusters inlarge spatial databases with noise. In E. Simoudis,J. Han, and U. Fayyad, editors, Second Inter-national Conference on Knowledge Discovery andData Mining, pages 226–231, Portland, Oregon,1996. AAAI Press.

[12] R. Goldman, N. Shivakumar, S. Venkatasubrama-nian, and H. Garcia-Molina. Proximity search indatabases. In Proc. 24th Int. Conf. Very LargeData Bases, VLDB, pages 26–37, 24–27 1998.

[13] P. M. Hall, D. Marshall, and R. R. Martin. Incre-mental eigenanalysis for classification. In BMVC,May 1998.

[14] J. Han and M. Kamber. Data Mining: Conceptsand Techniques. Morgan Kaufmann, 2001.

[15] T. Hastie, R. Tibshirani, and J. Friedman. TheElements of Statistical Learning: Data Mining, In-ference, and Prediction. Springer, 2001.

[16] D. P. Huttenlocher, G. A. Klanderman, and W. A.Rucklidge. Comparing images using the hausdorffdistance. IEEE Trans. Pattern Anal. Mach. Intell.,15(9):850–863, 1993.

[17] H. V. Jagadish, A. O. Mendelzon, and T. Milo.Similarity-based queries. In PODS ’95: Proceed-ings of the fourteenth ACM SIGACT-SIGMOD-SIGART symposium on Principles of database sys-tems, pages 36–45, New York, NY, USA, 1995.ACM Press.

[18] A. K. Jain, M. N. Murty, and P. J. Flynn. Dataclustering: a review. ACM Computing Surveys,31(3):264–323, 1999.

[19] R. A. Johnson and D. W. Wichern. Applied Mul-tivariate Statistical Analysis. Prentice Hall, fifthedition, 2002.

[20] I. T. Jolliffe. Principal Component Analysis.Springer-Verlag, 1986.

[21] S. Parthasarathy and M. Ogihara. Clustering dis-tributed homogeneous datasets. In PKDD ’00:Proceedings of the 4th European Conference onPrinciples of Data Mining and Knowledge Discov-ery, pages 566–574, London, UK, 2000. Springer-Verlag.

[22] J. C. Principe, N. R. Euliano, and W. C. Lefeb-vre. Neural and Adaptive Systems: Fundamentalsthrough Simulations. John Wiley and Sons, 2000.

[23] R. Reyment and K. G. Joreskog. Applied FactorAnalysis in the Natural Sciences. Cambridge Uni-versity Press, 1996.

[24] R. Subramonian. Defining diff as a data miningprimitive. In KDD 1998, 1998.

[25] J. Yang, W. Wang, H. Wang, and P. Yu. δ-clusters:Capturing subspace correlation in a large data set.In Proceedings of the 18th International Conferenceon Data Engineering (ICDE’02), page 517. IEEEComputer Society, 2002.

[26] K. Yang and C. Shahabi. A pca-based similaritymeasure for multivariate time series. In MMDB’04: Proceedings of the 2nd ACM international

workshop on Multimedia databases, pages 65–74,New York, NY, USA, 2004. ACM Press.


Temporal Data Mining Based on Temporal Abstractions

Robert Moskovitch and Yuval Shahar Medical Informatics Research Center

Department of Information Systems Engineering Ben Gurion University, P.O.B. 653, Beer Sheva 84105, Israel

robertmo,[email protected]

Abstract

Analyzing data for support of diagnostic tasks in dynamic domains, such as medicine, plant pathology, or information and communication technology security, requires an explicit representation and consideration of the temporal semantics of the data. However, discovering temporal knowledge is a challenging task. Temporal abstraction is a common task, based on temporal reasoning, which provides an intelligent interpretation and summary of large amounts of raw data. We suggest the application of temporal data mining mainly to time intervals of temporally abstracted data, instead of to only time-stamped raw data, and discuss its potential benefits.

1. Introduction

The analysis of large numbers of data collected over time, and the discovery of new knowledge from them,presents significant computational challenges. Muchprogress has been achieved in the area of 'static' data mining; however, there is still much room for further research regarding its extension to temporal data mining, in which the temporal dimension is represented and reasoned about explicitly.

Most of the work on temporal data mining had been on computational methods applied to raw time oriented data. Higher order mining, in which mining is applied to previously mined rules, is an area that has received little attention, but that holds the promise of reducing the overhead of data mining, as discussed in a recent survey [5]. However, the authors have also pointed out that care needs to be taken in such an automatic process.

We propose to exploit results from the work on the temporal abstraction task, mostly performed within the temporal reasoning community, as a preprocessing stage prior to the application of temporal data mining techniques. Temporal abstraction strives to summarize large amounts of time-oriented data using significant domain-specific knowledge. Using temporal abstractions may potentially reduce the amount of data and noise, while providing results that are in the domain expert's terms and that might be more valid. The intuition behind

this approach is similar to the one behind higher order mining, but the mining here is applied to data that are more meaningful to the domain experts; by performing the abstraction, we have already exploited the domain-experts' knowledge and learned from it. Thus, we term our approach intelligent temporal data mining (ITDM).

In this position paper, we briefly introduce the task of temporal data mining and the need for temporal abstractions. Then we introduce the Knowledge-Based Temporal Abstraction (KBTA) method as the proposed temporal abstraction mechanism within the ITDM framework. We suggest how the ITDM framework can potentially improve the task of temporal knowledge discovery and how, especially when using a KBTA-like abstraction method, it might iteratively extend theknowledge base. Finally, we discuss our future work.

2. Temporal Data Mining

Temporal Data Mining (TDM) can be defined as the activity of looking for interesting correlations or patterns in large temporal datasets. TDM has evolved from data mining and was highly influenced by the areas of temporal databases and temporal reasoning.

Several surveys on temporal knowledge discovery exist [5]. Most TDM techniques convert the temporal data into static representations and exploit existing 'static' machine learning techniques, thus potentially missing some of the temporal semantics. Recently there is a growing interest in the development of temporal data mining techniques in which the temporal dimension is considered more explicitly. Console et al. proposed an extension of the known Decision Trees induction algorithm to the temporal dimension [1]. One advantage of temporal decision treesis that the output of the induction algorithm is a tree that can immediately be used for pattern recognition purposes. However, the method can only be applied to time points, not to time intervals.

3. The Need for Temporal Abstraction

Abstractions of time-oriented raw data are called temporal abstractions (TA), a task usually using temporal reasoning techniques [2]. Temporal data abstraction had


attracted considerable research interest as a fundamental intermediate reasoning process for the intelligent interpretation of temporal data in support of tasks such as diagnosis and monitoring, and is crucial in the medical domain [4]. Background knowledge, commonly acquired from experts (e.g., classification tables, association rules, causal models) is "matched" against the time-oriented data records (e.g., time-stamped patient data). The result is a set of concepts at a higher level of abstraction than the raw data and interpreted over time intervals rather than only time points.

Lavrac et al. discussed the need for temporal abstraction when performing intelligent data analysis [4], as a preprocessing method prior to applying machinelearning; however, they didn’t refer to TDM and no implementation was shown. They discussed the use oftemporal abstraction in the medical domain within tasks such as diagnosis and prognosis determination; several TA approaches in the medical domain were compared. The authors highlight the advantages of the KBTA approach [6] due to its reusability capabilities and the generalization of the temporal abstraction mechanisms. We will also focus on the KBTA method, due to the advantages we see in using it for the ITDM approach.

4. Knowledge-based Temporal Abstraction

Knowledge-based Temporal Abstraction (KBTA) is a problem solving method, based on artificial intelligence techniques, developed by Shahar [6]. Originally developed within the medical domain, the framework has since been use d in multiple other domains, such as traffic control [7]; we will use examples mainly from the medical domain. KBTA infers domain-specific interval-based abstractions from point-based raw data. A very simple example is that the input might include a set of time-stamped hemoglobin measurements, while the output includes an episode of moderate anemia during the past 6 weeks, based on domain-specific knowledge stored in a formal knowledge-base. KBTA uses knowledge of [temporal] interpretation contexts to adjust its conclusions; contexts are generated dynamically from the data. For example, the method might use a differentdefinition of what constitutes moderate anemia in women as opposed to in men, or in men within the first 3 weeks after taking a particular medication.

Given an input of raw measured data (parameters) and external interventions (events), there are four primary output classes of abstractions generated by the KBTA method. A state abstraction takes as input one or more values and generates as output the value of the corresponding condition (e.g. high or low temperature). A gradient abstraction defines an interval during which the value of parameter is changing (i.e. increasing or

decreasing hemoglobin values). Rate abstractions summarize the rate of change, such as rapidly changing blood pressure. The final output type is a pattern abstraction, either linear (one time) or periodic (repeating), based on a set of time and value constraints. Intervals are interpolated from time points and shorter intervals using a temporal-interpolation model. Generating abstractions requires domain-specific knowledge, stored in the temporal-abstraction knowledge base. The knowledge contained in the knowledge base, such as the state-abstraction classification tables, the interpolation tables, or the temporal relationship between an event and the contexts it generates, is defined using the temporal-abstraction ontology (an ontology is a model of domain concepts, their properties, and the relations amongst them) [6].

.

•

0 40020010050

•

∆∆

1000

2000

∆( )∆ ∆ ∆

100K

150K

( )

•••

• • • •∆∆

∆

•

••

∆∆∆∆∆∆•••

Granu-locytecounts

• • •

∆ ∆∆∆

•

Time (days)

Plateletcounts

PAZ protocol

B[0] B[1] B[2] B[3] B[0] B[0]

BMT

Expected CGVHD

Figure 1: temporal abstraction of a single patient's data in an oncology domain. Raw data are plotted at the bottom; contexts and abstractions computed from the data are plotted as intervals above them.

5. Using KBTA for Temporal Data Mining

Typically, when a new temporal data set is explored, there exists some prior domain knowledge that is known by the domain expert. This knowledge can be represented and used to interpret the raw data. Basic types of temporal knowledge (states, gradients, and rates), based on the domain expert's previous experience, as well as simple temporal patterns well known to the expert, can be exploited easily within the KBTA framework, whose output will be a set of domain-specific abstractions interpreted over time intervals (i.e, interpretations of the raw data). Discovering temporal patterns from knowledge-based abstracted time intervals instead of from time-stamped raw data has potentially several advantages. These include less noisy data, an effect of the state abstraction and pattern-matching mechanisms; fewer missing values, due to the interpolation mechanism; and, typically, a smaller data set (considering only maximal-length abstract data types). The resulting abstractions are also much meaningful to a domain expert. On the other hand, temporal patterns, especially complex ones, are


hard to acquire from a domain expert. However, repeating patterns of temporal abstractions, even rather complex ones, might well be detected automatically given existing abstract components, and can either lead automatically to the addition of new patterns to the knowledge base or, when shown to the expert, can lead her to define new patterns. TDM can thus contribute to the discovery and acquisition of new temporal patterns, and thus to the extension of the domain-specific temporal-abstraction knowledge, as defined in the KBTA framework. Thus, the ITDM process is an iterative one (Figure 2).

Figure 2: The iterative process of intelligent temporal data mining, which includes feeding new discovered temporal patterns to the temporal-abstraction (TA) knowledge base (KB).

Figure 2 illustrates the process of acquiring knowledge by using TDM in tandem with the KBTA framework. Initially the domain expert's temporal-abstraction knowledge, such as states, gradients, rates, and temporal patterns (if available), are acquired into the knowledge-base. The KBTA computes temporal abstractions, which are stored in the database. Then, the ITDM engine is called to perform a specific task. Examples include learning certain class of temporal association rules in a supervised fashion, or detecting, in a self-organizational fashion, a set of temporal pathways along which therecords (e.g., diabetes patients) can be clustered (not necessarily supplying any labels). Each new discovered temporal rule is presented to the expert, which might lead to a new temporal-abstraction pattern to be stored in the temporal-abstraction knowledge-base. Thus, potentially, a new temporal pattern using the pattern just added to the knowledge base might be discovered.

6. Discussion

We presented the ITDM process and the potential benefits of applying TDM to more stable, domain-specific abstractions, at multiple levels of abstraction, interpreted over time intervals, as opposed to its application to raw data interpreted over time points. Based on previous

experience [4, 6, 7] we suggest the KBTA framework for the temporal abstraction task. We have shown how TDM might potentially contribute to the expansion of the temporal knowledge-base used by the KBTA method, and thus, in principle, iteratively expend the set of discovered temporal patterns and temporal rules.

A TDM technique required which can analyze time intervals and that results in an explicit symbolic temporal pattern representation. Kam and Fu [3] suggested analgorithm for discovering temporal patterns from time intervals. However, their approach discovers all the patterns within the dataset in an unsupervised fashion and cannot be applied in a supervised fashion, in which a temporal pattern related to a specific concept is searched for. Console et al [1] proposed the induction of temporal decision trees, which can be relevant to our work but has to be extended to induce from observations and timeintervals as input data. Note that there is a need for both a method for learning new temporal patterns as well as an efficient method for recognizing (or detecting) known patterns given the data and a set of rules. We are currently exploring several algorithmic options for performance of both tasks, including the investigation of the optimal abstraction level[s] to which to apply the ITDM process.

7. References

[1] L. Console, C. Picardi, D.T. Dupre, Temporal Decision Trees: Model-based diagnosis of Dynamic Systems On-Board, Journal of Artificial Intelligence Research, 19 469-512, 2003.

[2] M. Fisher, D. Gabbay, L. Vila, Handbook of Temporal Reasoning in Artificial Intelligence, Elsevier, 2005.

[3] P. Kam and A.W. Fu, Discovering Temporal Patterns for Interval-based Events, Proc of 2nd Int Con on Data Warehousing and Knowledge Discovery, 2000.

[4] N. Lavrač, I. Kononenko, E. Keravnou, M. Kukar, B. Zupan, Intelligent data analysis for medical diagnosis: using machine learning and temporal abstraction, AI Communications, 11:191-218, 1999.

[5] J.F. Roddick and M. Spiliopoulou, A Survey of Temporal Knowledge Discovery Paradigms and Methods,IEEE Transactions on Knowledge and Data Engineering, 14(4), 2002.

[6] Shahar, Y., A framework for knowledge-based temporal abstraction, Artificial Intelligence, 90(1-2):79-133, 1997.

[7] Shahar, Y. and Molina, M., Knowledge-based spatiotemporal linear abstraction. Pattern Analysis and Applications, 1(2):91-104, 1998.


Incremental Maintenance of Wavelet Synopses for Data Streams

Ken Hao Liu

Department of Electrical Engineering

National Taiwan University

Taipei, Taiwan

[email protected]

Wei Guang Teng

Department of Engineering Science

National Cheng Kung University

Tainan, Taiwan

[email protected]

Ming Syan Chen

Department of Electrical Engineering

National Taiwan University

Taipei, Taiwan

[email protected]

Abstract

Due to the dynamic nature of data streams, a sliding win

dow is used to track the most recent data within the ret

rospective period. To further satisfy resource constraints,

only data summaries instead of all historical data are main

tained to answer queries or to discover patterns. In this pa

per, we exploit the properties of Haar wavelet transform, a

method for data reduction, to generate synopses, and de

velop a novel technique to incrementally maintain the syn

opses over consecutive time windows. Our technique di

rectly operates on the synopses in the transformed time

frequency domain without the need to store or to reconstruct

the detailed contents. Required resources of both comput

ing power and buffer space can thus be greatly reduced.

Furthermore, the synopses can be used to answer various

kinds of temporal queries and to provide 2 norm errors

as the quality indicator when tracking data streams such as

measurements collected in sensor networks.

1 Introduction

There are many emerging applications where data is

collected in the form of continuous data streams, as op

posed to finite stored databases. For example, in sensor

networks, numerical values that are continuously generated

by a sensor node form a massive unbounded sequence of

data. However, not all historical data records but only those

whose time tags are within the retrospective period can be

stored for possible queries. Consequently, a simple but ef

fective data model of sliding window is often used to re

ect the effect that obsolete records falling out of the ret

rospective period are discarded when new ones are col

lected as time advances. However, computation resources

may be prohibitively constrained for the length retrospec

tive period required in practical applications. Consequently,

proper maintenance or tracking mechanisms for historical

data should be carefully designed to answer queries or dis

cover patterns. In general, the following requirements of

data streams should be considered. First, each data point

should be examined at most once when analyzing the data

stream. Second, the storage space for maintaining data syn

opses should be bounded. Third, the newly arriving data

points should be processed as fast as possible to accomplish

real time computing, i.e., the processing rate should be at

least the same as the data arrival rate. Finally, the up to

date analysis results of a data stream should be instantly

available when requested.

With limited resources in data stream processing sys

tems, approximation and adaptability are recognized as im

portant issues. With proper techniques to generate syn

opses, the retrospective period for synopses can be signif

icantly prolonged even if the available buffer space is kept

fixed. In prior works [5, 8, 9, 10, 14, 15, 16], several data

transformation and compression techniques are developed

to extract features of numerical time series that can be good

candidates for our purpose of saving precious resources.

As time advances, since there are data insertions and dele

tions due to time window constraints, we can hardly afford

the cost resulted from performing the transformation or the

compression repetitively to obtain the updated synopses. In

addition, the reconstruction from the synopsis should be

avoided unless a query is to be processed, since this op

eration not only is time consuming but also requires a large

buffer space to hold the reconstructed contents. Generally

speaking, both space and time constraints inherent in such

data stream applications should be satisfied when maintain

ing the synopses over consecutive time windows.

We mention in the passing some related works on the

computation of approximate statistics over the time win

dows. In [13], algorithms are proposed based upon prob

abilistic counting and sampling to maintain wavelet based

histograms. However, they deal with the problem of mon

itoring a fixed number of variable values rather than that

of continuous numerical streams. Consequently, the algo


rithms cannot solve our problem with any direct extension.

In [7], wavelet based approximations are used to generate

sketches of the data stream to answer aggregate queries.

In [2], various deterministic and randomized algorithms for

maintaining approximate counts and quantiles over a stream

of sliding windows using limited space are proposed. In [3],

techniques to maintain variance with a continually updated

estimate of the variance of the last values in a data stream

are presented. In [6], a scheme to maintain the sum of the

last integers with matching lower and upper bounds is de

veloped. In [4], the SWAT structure is proposed to adapt to

online updates. However, since the thresholding strategy of

SWAT is not adaptive to data distribution, the reconstructed

error of the SWAT structure tends to grow exponentially as

time advances. Moreover, there is no direct way to gradu

ally retire old data elements from the SWAT structure.

In this paper, we focus on the data structure and the

techniques to directly and incrementally maintain the syn

opsis generated by Haar wavelet transform over consecu

tive sliding windows. A corresponding scheme which uses

bounded space and time is devised. Our approach directly

operates on the wavelet coefficients without the need to re

construct the content of the current window when perform

ing incremental updates. We propose a data structure which

is named Sliding Dual Tree, abbreviated as SDT, and an

algorithm Direct Cyclic Incremental Forwarding, abbrevi

ated as DCIF, to incrementally maintain the wavelet based

synopsis. Moreover, we also propose an algorithm Di

rect Cyclic Incremental Rewinding, abbreviated as DCIR,

to reconstruct the content of the most recent time window

in response to user queries on the wavelet based synopsis.

Specifically, algorithm DCIF incrementally maintains SDT

structure by exploiting the locality and the multi resolution

properties of the wavelet transform while algorithm DCIR

reconstructs the contents from the maintained synopsis. The

quality of the synopsis in terms of 2 norm is also incre

mentally tracked. The proposed algorithms are both time

and space efficient to meet the resource constraints in data

stream applications such as tracking and monitoring sensor

readings in sensor networks.

The rest of the paper is organized as follows. The pre

liminaries of data streams and data processing techniques

are explored in Section 2. The proposed algorithm DCIF

for incremental maintenance of wavelet based synopses and

the proposed algorithm DCIR for query answering are de

scribed in Section 3. Empirical studies are conducted in

Section 4. This paper concludes with Section 5.

2 Preliminaries

The data models and data reduction techniques com

monly used in data stream applications are introduced in

Section 2.1 and Section 2.2, respectively. Section 2.3 de

scribes the problem of online updates of wavelet synopses.

2.1 Data Models and Query Processing

A sensor network deployed for environmental monitor

ing is a common scenario of data stream processing sys

tem where sensors are switched on to collect measurements

at distinct points in time, and are switched off the rest of

time. This periodic scenario which provides snapshots at

regular intervals is mainly for the purpose of minimizing

energy consumption. For simplicity, it is further assumed

that the time interval of measurements for each sensor is

identical and the clocks for all sensors are synchronized.

Consequently, the time tag for obtained measurements can

be made discrete and is denoted as =1, 2, .... In addition,

the measurement obtained by sensor node at time can

be thus represented by . Note that is usually a nu

merical value in practical applications, e.g., can be the

reading of sensed temperature or atmosphere pressure.

For many applications, we are more interested in the re

cent elements of a data stream due to the dynamic nature

of the underlying data. Specifically, the concept of retro

spective period is introduced in which the historical values

within the past period of time intervals can be queried.

Note that the retrospective period may vary from one

data stream to another. Also, the range of feasible retro

spective periods may vary with the available amount of re

sources in data stream processing system. Consequently, a

simple data model of sliding window can thus be employed

to clearly represent the effect of the new measurements as

they are collected over time. Namely, a sliding window cov

ers the most recent elements, where the retrospective pe

riod is taken as the size of the sliding window.

2.2 Dimensionality Reduction for ApproximatingTime Series

For data streams such as those generated from sensor

networks, the volume of data is usually too huge to be stored

or to be scanned thoroughly more than once. To ease the

difficulties resulted from the typically high dimensionality

of the data, many promising solutions involving dimension

ality reduction have been proposed. These techniques in

clude the Singular Value Decomposition (SVD), the Dis

crete Fourier Transform (DFT), the Discrete Wavelet Trans

form (DWT) [14] and more recently the Piecewise Constant

Approximation (PCA) [8, 9, 10]. Among many alternatives,

the wavelet transform has been shown to be an effective

technique for synopsis generation in the data stream envi

ronment due to its low complexity [5]. Moreover, the trans

formed coefficients tend to distribute non uniformly such

that most of the coefficients, i.e., those are of small magni

tudes, can be discarded to save the storage space with little

2


sacrifice on the quality of reconstructed series. In one pass

through the data elements, the transformed coefficients can

be obtained. Various thresholding policies are then used

to discard insignificant coefficients to generate the synop

sis. The time complexity of the wavelet transform is ( ),where is the number of the input data elements. The fol

lowing example uses Haar wavelets, which is the simplest

form of wavelets, to transform the time series values into a

series of coefficients for each wavelet basis function.

Example 1: Suppose that there are eight values collected at

some moment that form a numerical time series, S=64, 48,

16, 32, 56, 56, 48, 24.

To begin with the multi resolution analysis, the values

are pairwisely averaged to get a low resolution signal first.

Therefore, we have 56, 24, 56, 36, where the first two val

ues in the original signal, i.e., 64 and 48, averaged to 56, the

second two values 16 and 32 averaged to 24, and so on. To

avoid losing any information in this averaging process, the

difference values which are 8(=64 56), 8(=16 24), 0(=56

56), 12(=48 36) should also be stored. As such, the origi

nal values can be reconstructed from these average and dif

ference values.

Table 1. Wavelet based multi resolution analysis.Resolution Averages Differences

8 64,48,16,32,56,56,48,24

4 56,24,56,36 8, 8,0,12

2 40,46 16,10,8, 8,0,12

1 43 3,16,10,8, 8,0,12

Note that wavelet coefficients which correspond to dif

ferent resolution scales are generated recursively. The de

composed wavelet coefficients for original series are

bS=43, 3, 16, 10, 8, 8, 0, 12.

2.3 Online Updates with Wavelet Based Approaches

In this paper, the wavelet based approach is adopted as

the basis for storing the variations of the collected data

streams. Among all the representations, the error tree struc

ture [12] can be utilized to best clarify the relationship

among decomposed wavelet coefficients. For example, the

corresponding error tree of Example 1 can be shown in Fig

ure 1. Each internal node is associated with a wavelet co

efficient, and each leaf node is associated with an original

value. Moreover, the value of a leaf node can be obtained

by performing additions or subtractions on the values of in

ternal nodes along the path from root to that leaf node. Note

that for ease of exposition, the wavelet coefficients are not

43

-3

16

8 -8

64 48 16 32

10

0 12

56 56 48 24

S(0)

S(1)

S(2) S(3)

S(4) S(5) S(6) S(7)

S(0) S(1) S(2) S(3) S(4) S(5) S(6) S(7)

+ -

+ - + - + - + -

+ - + -

Figure 1. Error tree structure for representingthe relationship between original values andwavelet coefficients.

normalized here. However, these coefficient values are nor

malized in the algorithm implementation and in the experi

ments.

It is noted that the original series can be represented by

the summation of coefficients at different levels, i.e., level of

resolution. Also note that a time series can be approximated

by reconstruction of only coefficients in the top few levels.

This is an efficient way to construct an approximation. On

the other hand, for the purpose of reducing the approxima

tion error, the strategy of selecting coefficients to retain, i.e.,

the thresholding policy, will be discussed in detail in later

sections.

To fit in the data stream environment, the SWAT struc

ture proposed in [4] can also be taken as a modified form

of the error tree, which shows the adaptability of this struc

ture to online updates. With the SWAT structure, the stor

age cost of the synopsis is O(log ) where is the to

tal number of data points within the observation window.

Specifically, SWAT can be viewed as another thresholding

policy where only coefficients located at the right most po

sitions of each level on the SWAT structure are retained.

However, the SWAT structure are used to build synopses

biased toward the most recent data and the storage space in

creases as the length of the observation window increases.

In this paper, we consider the problem of the incremental

maintenance of the synopsis of the most recent data el

ements within the user specified retrospective period. Our

proposed method exploits the error tree structure and the lo

cality property of wavelet based synopsis to incrementally

maintain the error tree structure by directly performing op

erations on the wavelet based synopsis corresponding to the

insertion of newly arriving data and the deletion of oldest

data over consecutive sliding windows.

3


3 Incremental Maintenance of Synopsis over

Sliding Windows

The major features and algorithmic forms of the pro

posed algorithm DCIF and algorithm DCIR are described

in Section 3.1 and Section 3.2, respectively. Section 3.3

presents formulas to track the quality of the maintained

wavelet synopsis. Complexity of algorithm DCIF and

DCIR are analyzed in Section 3.4. Section 3.5 brie y de

scribes how to manage resource adaptively for multiple data

streams.

Problem Definition Given an update = 1 of size within a unit time slot , where is a sin

gle real value, a sequence of wavelet transformed synop

sis = 0 of current time window of size

ending at the current time , usually , where is the

number of coefficients that remain after synopses compres

sion, e.g., via thresholding. Maintain the synopsis for the

time window ending at time + . The maintenance in

cludes inserting the newly arriving data of size into the

synopsis and removing from the synopsis the correspond

ing oldest data of size in the original sliding window. The

maintained synopsis is used to answer user queries about

the content of the most recent sliding window.

3.1 Incremental Maintenance of Error Trees

Due to the limited resource constraints, it is usually not

feasible to store all the data points in the current time win

dow. One naive approach is to reconstruct the approximated

original data from the synopsis and rebuild the new synop

sis from scratch. However, the space and time complexity

of this approach are both ( ), where is the number of

elements in a sliding window. For practical window sizes,

this approach incurs too much processing overhead and re

quires a lot of buffer space in a data stream environment

such as sensor networks where energy, computing power

and storage space are precious resources. To remedy this,

we propose an alternative data structure SDT and algorithm

DCIF by exploiting the locality property [11] of the wavelet

transform and the error tree structure to keep the size of the

synopsis constant and incrementally maintain the content of

the synopsis.

The wavelet transform is recognized as a widely used

tool that can extract both frequency and time information

simultaneously from the input sequence. Unlike other pop

ular transform like Fourier transform, which generates glob

ally averaged frequency information, the wavelet transform

extracts frequency features that localized in the time do

main. The basis functions of the Haar wavelet transform

represent a multi resolution decomposition of the input se

quence. The set of basis functions of the Haar wavelet

43

64 48 16 32 56 56 48 24

S(0)

S(0) S(1) S(2) S(3) S(4) S(5) S(6) S(7)

8 -8 0 12

16 10

-3

Level 2

Level 1

Level 0

S(1)

S(2) S(3)

S(4) S(5) S(6) S(7)

time

Figure 2. The time domain locality for different

frequency levels for corresponding nodes onthe error tree structure.

transform has locality not only in the frequency domain but

also in the time domain. Figure 2 shows an example time

and frequency decomposition of the Haar wavelet transform

from Example 1. High frequency components are localized

in the time domain [11]. This property can be exploited to

localize the update of the error tree.

Specifically, consider the error tree of the current sliding

window at time , on a level , the th coefficient is local

ized in the time interval from to ( + 1) , where

is the time span at level which is defined as

=2

where is the length of the sliding window.

There are a total of 2 levels of frequency components

and thus a total of 2 levels of length of time span for a

given window size . At the next sliding window, i.e. at +, the time interval [ ] is the arriving time interval +

and the time interval [ + ] is the expiring time

interval . From the error tree structure, we can identify

coefficients that are localized in the expiring time interval

and those that are localized in the arriving time interval

+. The former can be deleted from the synopsis while the

latter are inserted into the synopsis. Specifically, given time

, we check ifj

mod

k0 at level . If so, then there

exists expiring nodes at this level. At this level, the expiring

error tree node index of coefficient , bS( ), in the the

expiring time interval can be derived as

=

¹2

+

º1 for = 1 2 1,

where is the total number of levels and

=

½mod if mod 6= 0

if mod = 0.

Note that the checking of expiring indices is performed it

eratively in a levelwise fashion, i.e., = 1 2 1.

4


However, if for a specific level, there is no expiring node,

i.e.,j

mod

k= 0, the checking is stopped. In addition, if

is found to be 1, i.e. when bS(1) expires, bS(0) automati

cally expires. Then we proceed to check at level 1 until

we reach the level where there are not any expiring nodes.

Based the above relation, we extend the error tree struc

ture and propose an alternative data structure called Sliding

Dual Tree, SDT. The wavelet synopsis as represented by

the SDT goes through various stages as it is incrementally

phased out by a series of DELETE operations. The most

recent synopsis is incrementally generated via a series of

INSERT and MERGE operations.

43

64 48 16 32 56 56 48 24

S(0)

S(0) S(1) S(2) S(3) S(4) S(5) S(6) S(7)

8 -8 0 12

16 10

-3

Level 2

Level 1

Level 0

S(1)

S(2) S(3)

S(4) S(5) S(6) S(7)

Figure 3. DELETE operation at t=12.

For example, consider the synopsis in Figure 3. At time

= 10, we have¥10mod 8

2

¦0 and

=

¹8

23 2+10mod8

2

º1 = 4 for = 2,

and we check the next level = 1, where¥10mod 8

4

¦= 0,

which means all expiring nodes are found. Therefore, we

find bS(4) expires at this instant. Similarly, at time = 12,we have

¥12mod 8

2

¦0 and

=

¹8

23 2+12mod8

2

º1 = 5 for = 2,

and thus bS(5) is the expiring node. In addition, we have¥12mod 8

4

¦0 and

=

¹8

23 1+12mod8

4

º1 = 2 for = 1,

and thus bS(2) is also expiring at this time instant. The nodes

at the specified indices are removed from the synopsis at the

time instant when they expire. This procedure is called the

DELETE operation.

Suppose at = 12, we have the newly arriving update

sequence, 50, 50. By performing Haar wavelet trans

form, we obtain the corresponding Haar wavelet transform

sequence 50, 0. The resulting error tree structure is in

cluded as part of the synopsis and this procedure is called

the INSERT operation.

With the utilization of error tree structures, to merge two

sub trees of identical height to form a bigger error tree

of height + 1 is a trivial task. For example, in order to

merge the error tree of the first four values with the one of

the second four values of the sample series in Example 1,

only the coefficients of the top level, i.e., 40 and 46, have

to be averaged again to form new nodes, i.e., 43 and 3 as

shown in Figure 4. This procedure is called the MERGE

operation.

4640

16

8 -8

64 48 16 32

10

0 12

56 56 48 24

S(0) S(4)

S(1) S(5)

S(2) S(3) S(6) S(7)

S(0) S(1) S(2) S(3) S(4) S(5) S(6) S(7)

+ - + - + - + -

+ - + -

4640

43

-3+ =

Figure 4. Merging of two error trees with the

same height .

In effect, the combination of these three operations fa

cilitates the incremental maintenance of the wavelet synop

sis while keeping the size of required storage constant. The

SDT behaves like a set of hierarchical sliding windows with

different sliding intervals. Since it is infeasible to calculate

the synopsis by storing the whole content in each sliding

window, our approach maintains the synopsis incrementally

while keeping the maintained synopsis of constant size. The

maintained error tree structure varies as new points arrive

continuously. Figure 5 and Figure 6 show illustratively the

SDT and the sequence of operations at different time points.

Example 2 shows how we can use INSERT, MERGE and

DELETE operations to generate SDT data structure.

Example 2: Consider Figure 2. Suppose that the syn

opsis for the current window = [0 8) is bS =43 3 16 10 8 8 0 12. Let the update sequence be

= 64 48 16 32 56 56 48 24, which repeats the pre

vious sequence here for illustrative purpose. At t=10, the

Haar wavelet coefficients of 64, 48 is bO =56, 8

and dS(4) is deleted from bS as shown in Figure 5(a). At

t=12, bO : 16, 32– 24, 8 are merged to get newbO =40, 16, 8, 8 and dS(5) and dS(2) are deleted frombS as shown in Figure 5(b). At t=14, bO =56, 0 and

5


(a) t = 10

S(0)

Level 2

Level 1

Level 0

S(1)

S(2) S(3)

S(4) S(5) S(6) S(7)

S(8) S(9)S(0) S(1)

T- T+

Data Stream

DELETE INSERT

S(0)

Level 2

Level 1

Level 0

S(1)

S(2) S(3)

S(5) S(6) S(7)

Data Stream

S(10) S(11)

T+

S(2) S(3)

T-

DELETE INSERT

MERGE

(b) t = 12

Figure 5. Examples of SDT data structure andthe corresponding operations: (a) t = 10, INSERT and DELETE are localized at the bottomlevel. (b) t = 12, INSERT, DELETE and MERGE

to propage the update to one level higher.

dS(6) is deleted as shown in Figure 6(a). At t=16, bO :

48, 24– 36, 12 are merged with to get bO =46,

10, 0, 12 and then further merged to get bO =43, 3,

16, 10, 8, 8, 0, 12 and dS(7), dS(3), dS(1) and dS(0) are deleted

for bS as shown in Figure 6(b). The operations are summa

rized in Table 2. The obtained result at t=16 is the wavelet

synopsis for the time window t=[8, 16), which is identical

to that obtained by performing Haar wavelet transform di

rectly to the update sequence.

Table 2. Incremental maintenance operations forExample 2.

t Update DELETE INSERT MERGE

10 64,48 bS(4) 56,8 N/A

12 16,32 bS(5),bS(2) 24, 8 40,16,8, 8

14 56,56 bS(6) 56,0 N/A

16 48,24 bS(7),bS(3), 36,12 46,10,0,12,bS(1),bS(0) 43, 3,16,10,

8, 8,0,12

In this paper, we devise Algorithm Direct Cyclic Incre

mental Forwarding, abbreviated as DCIF, to incrementally

S(0)

Level 2

Level 1

Level 0

S(1)

S(3)

S(7)

T+

S(6) S(7)

T-

Data Stream

S(14) S(15)

DELETE INSERT

MERGE

S(0)

Level 2

Level 1

Level 0

S(1)

S(3)

S(6) S(7)

T+

S(4) S(5)

T-

S(12) S(13)

Data Stream

DELETE INSERT

(a) t = 14

(b) t = 16

Figure 6. Examples of SDT data structure andthe corresponding operations: (a) t = 14, INSERT and DELETE are localized at the bottomlevel. (b) t = 16, INSERT, DELETE and MERGE

are used to incrementally build the most recent error tree.

maintain the wavelet synopses as represented by the SDT

data structure. Specifically, algorithm DCIF is outlined be

low.

Algorithm DCIF: Direct Cyclic Incremental Forwarding

Input: error tree of current time window with current

time point and update o with size .

1. Obtain the Haar wavelet error tree 0 of update

2. DELETE the expired nodes at level = 13. INSERT 0 into the synopsis

4. while the next level has expiring node

5. DELETE the expiring node

6. end of while

7. while the size of the most recent error sub tree 0 = the

size of the next most recent sub tree 1

8. MERGE 0 with 1

9. end of while

3.2 Temporal Query Processing

Our SDT scheme provides an efficient infrastructure for

temporal query processing. The synopses in the SDT struc

6


ture are incrementally maintained by DCIF in the time

frequency domain of Haar wavelet. As an inverse of DCIF,

algorithm Direct Cyclic Incremental Rewinding, abbrevi

ated as DCIR, is performed on the maintained wavelet syn

opses to reconstruct of the content of the current sliding

window. Specifically, algorithm DCIR is outlined below.

Algorithm DCIR: Direct Cyclic Incremental Rewinding

Input: the collection of error trees, respectively, current

time point and sliding window length .

1. while there is any error tree

2. Find the most recent error tree

3. Reconstruct from t to obtain data elements

4. Append to the front of the reconstructed sequence

5. end of while

6. return the elements in

Example 3: Continue from Example 2. Suppose at time

point = 12, the following query is to be processed.

Q (historical query): "What was the last readings of tem

perature?"

Suppose that = 8. We found that the synopsis at time

point = 12 is cS1=43, 3, X, 10, X, X, 0, 12, where X

denotes the deleted nodes and cS2=40, 16, 8, 8. To an

swer the query, algorithm DCIR proceeds to reconstruct the

contents of the sliding window w=[4,12). Algorithm DCIR

first reconstruct from cS2 and obtain 0 =64 48 16 32.Next the deleted nodes are treated as zeros, and the recon

structed sequence 1fromcS1 is thus 40, 40, 40, 40, 56, 56,

48, 24. The first four elements are deleted since they falls

out of the scope of the query and thus the answer of Q isbS =56, 56, 48, 24, 40, 16, 8, 8.

3.3 Tracking the Quality of Synopses

Generally speaking, there exists a trade off between the

quality of reconstructed time series and the storage space

for retained coefficients. Time series generated by contin

uous random variables require an infinite number of bits

to represent, and quantization is thus necessary for practi

cal finite representation. However, since quantization intro

duces error, we need to find the best trade off between the

required storage space and the reconstructed quality. For

wavelet synopses, while we can omit insignificant values

to meet storage space constraints, it is important to under

stand that for the purpose of approximation, the quality of

the synopses is in proportion to the number of significant

coefficients stored.

The goal of thresholding is to determine the coefficients

to keep so as to minimize the error of approximation. A

straightforward thresholding policy, i.e., the hard threshold

ing, is to have all wavelet coefficients less than a fixed con

stant set to zero. Provided that the transformation is ortho

normal, this selection process is optimal in minimizing the

absolute error [5]. Approximation of the synopsis are com

monly measured in terms of 2 norm (or the Euclidean dis

tance) that re ects the average sequence distance between

the two series and is defined as follows.

2 =

s1 P

=1

( )2

where and for =1, 2, ..., n are the th values of two

time series and , respectively.

The top thresholding techniques are often used to re

duce the size of the synopsis from , the number of data

in a time window, to . A small number usually suf

fices to yield reasonably good quality. Based on the en

ergy preservation property of wavelet based transform, the

commonly used 2 norm error for measuring the quality

of maintained synopses can be calculated when discarding

insignificant coefficients. Note that selecting top coef

ficients with largest absolute values has been recognized

as the optimal thresholding strategy in reducing 2 norm

error[11].

Let denote the sum of squared difference between

insertion of size , , and its corresponding wavelet syn

opsis,0

, at = . Let be the sliding window size. Let

and 0 be, respectively, the original data and the wavelet

synopsis after insertions, where = 1 2 . Then

the 2 norm of 0, denoted by0

, is found by the following

theorem.

Theorem 1: We can keep track of the approximation error

of the incrementally generated wavelet synopsis by0

=P=1

.

Proof: This theorem follows directly from the definition.

0

=P=1

(0

)2

=P=1

P=1

(0

)2

=P=1

Q.E.D.

Example 4. Consider the incremental maintenance oper

ations of Table 2 at Example 2. Suppose that the top

thresholding policy are used to reduce the storage size of

the wavelet synopses where = 4, i.e. the top 4 co

efficients are kept and the less significant ones are dis

carded. The incremental maintenance operations and the

discarded coefficients at each time point are shown in Ta

ble 3. At = 16, the error is found by Theorem 1 as0 =

X= 32 + 82 + 82.

7


Table 3. Incremental maintenance operations forExample 4.

t DELETE MERGE

10 bS(4) N/A 8

12 bS(5) bS(2) 40, 16, 0, 0 8

14 bS(6) N/A

16 bS(7) bS(3) 46, 10, 0, 12 3bS(1) bS(0) 43, 0, 16, 10, 0, 0, 0, 12

3.4 Complexity Analysis

Since the online processing of updates should minimize

the time and space overhead for the incremental mainte

nance of synopsis, Algorithm DCIF does not try to obtain

the exact synopsis as calculated off line. Instead, the syn

opsis is maintained incrementally by deletion and insertion.

By exploiting the locality property of wavelet transform,

DCIF is an effective and efficient technique to process the

continuously arriving data and match the computing speed

with the arriving data rate. The time and space complexity

of algorithm DCIF/DCIR are analyzed in this section.

Algorithm DCIF involves the INSERT, DELETE and

MERGE operations. The time complexity of these opera

tions and the overall algorithm is described in the following

theorems.

Theorem 2: The time complexity of algorithm DCIF is

( 2 ), where is the size of the sliding window.

Proof: Suppose the length of the update sequence is ,

where . The deletion of expiring coefficients is

done in (1) time while the generation and insertion of new

coefficients are done in ( ) time. Merging of two error

tree is done in (1) time and there are at most ( 2 )merging for each update. The time complexity of merg

ing subsumes those of insertion and deletion. Therefore the

overall time complexity is (log2 ). Q.E.D.

Theorem 3: The space complexity of algorithm DCIF is

(1) = , where is the number of coefficients preserved

after thresholding.

Proof: Since algorithm DCIF deletes coefficients and

insert coefficients, the space remains constant, which

equals the number of non zero coefficients preserved in the

synopsis after thresholding, . Q.E.D.

The query of the maintained synopsis perform the in

verse of DCIF to rewind the synopsis and perform inverse

Haar wavelet transform to reconstruct the time series in the

most recent sliding window. The time and space complexity

of algorithm DCIR are analyzed as follows.

Theorem 4: The time and the space complexities of algo

rithm DCIR are both ( ), where W is the size of the

sliding window.

Proof: Since the time and space complexity of inverse Haar

wavelet transform is ( ), where is the number of coef

ficients. In addition, the number of coefficients in the main

tained synopsis is ( ), where is the size of the sliding

window, the time and space complexity of reconstruction

are both ( ). Therefore the overall time and space com

plexity of algorithm DCIR is ( ). Q.E.D.

3.5 Adaptive Resource Management

For the purpose of tracking in a data stream environment,

with our wavelet based algorithm DCIF, the retrospective

period and the top thresholding policy for a single

sensor node can be adaptively adjusted to achieve effective

resource management. For example, when the maintained

synopses failed to satisfy the requirement of an incoming

user query, the sensor node can re initiate the tracking by

increasing . Moreover, if the quality of the synopses

as maintained by DCIF falls below default system require

ments, the thresholding policy is relaxed such that more

storage units are allocated for future synopses. In addition,

data streams might have different arriving rates. Streams

which are updated less frequently are allocated with less

space.

As mentioned above, these various approaches for per

forming adaptive resource management with algorithm

DCIF improve utilization of the available resources and the

quality of the synopses on a single sensor node. For the

overall sensor networks, for example in the central server

where the synopses of multiple time series are maintained,

the value of can be a constant for every synopsis which

results in a constant compression ratio for each time series.

The major problem is that the overall space for storage is in

proportion to the number of time series, and thus may in

crease as time advances. Therefore, the values of k for each

time series are adaptively decided with the policy that all

reconstructed series are of about the same energy preserva

tion ratios. Algorithm DCIF can also be used to guarantee

the amount of memory for storage is constant over time and

maximize the quality of the maintained synopsis at the cen

tral site with more resource available.

4 Empirical Studies

The simulation model of our experimental studies is de

scribed in Section 4.1. To assess the performance of the

proposed algorithm DCIF, we conduct experiments based

on both synthetic and real datasets. The execution efficiency

of algorithm DCIF is shown in Section 4.2. The quality of

the wavelet based synopsis maintained by algorithm DCIF

and algorithm SWAT indicated by 2 norm errors is shown

in Section 4.3 for comparison purposes.

8


-40

0

40

80

120

0 200 400 600 800 1000

Time Tick (day)

Tem

pera

ture

(F

)

Kuala Lumpur

Moscow

(a)

0

400

800

1200

1600

0 200 400 600 800 1000

Time Tick

Valu

e

RandomWalk1

RandomWalk2

(b)

Figure 7. Real and synthetic time series fortesting: (a) temperature variartions of twocities: Kuala Lumpur and Moscow (b) tworandom walk series.

4.1 Simulation Model

In our experiments, real datasets are used as our testbed.

The real dataset is obtained from the Average Daily Temper

ature Archive of the University of Dayton [1], in which the

average daily temperatures of 290 cities around the world

are recorded as numerical time series. The daily average

temperatures of each city are recorded from January 1, 1995

to the present. To conduct experiments on time series of dif

ferent distributions, temperature variations of two cities are

observed. Specifically, these two cities are Kuala Lumpur

and Moscow. In addition, the length of both series is 1,024

points with each point being a daily measurement, i.e., the

observation period is about three years. As shown in Figure

7(a), the temperature in Kuala Lumpur is nearly a constant

while a seasonal periodicity exists for that in Moscow.

The synthetic dataset used are random walk series. It is

generally recognized that many real applications generate

data in the form of random walks, e.g., variations of the

stock price. By specifying the initial value 1,000 and the

step length 10, two generated series are shown in Figure

0

5

10

15

20

25

0 128 256 384 512 640 768 896 1024

time point

tim

e(m

ills

ec)

naive

DCIF

Figure 8. Execution time comparison.

7(b). Note that the length of both synthetic series is also

1,024 points.

4.2 Execution Time

Figure 8 shows the accumulated execution time of algo

rithm DCIF and the naive approach. In the naive approach,

the wavelet synopsis is first reconstructed to get the original

window, the window is shifted to include the update and the

shifted window is used to generate the new synopsis. Note

that algorithm DCIF finishes the maintenance operation al

most instantaneously on each update while the running time

of the naive approach grows roughly linearly with the num

ber of data points, showing the computation efficiency of

algorithm DCIF for data stream applications.

4.3 Quality of Synopsis with Fixed Storage Space

Figure 9 shows the 2 error, the difference of the recon

structed sequence from the compressed synopsis and the

original data sequence, at different time points. Note that

the storage cost limits the upper bound of the top thresh

olding, i.e. the storage cost of the synopses maintained by

DCIF remains constant while that of SWAT is ( ),where N is the length of the observation window. The win

dow size are chosen to be 128 and thus SWAT structure will

take 3 log2 128 = 21 storage units. The k value of the

top thresholding for DCIF structure are also fixed at this

value for comparison purposes. We found that the quality of

the synopsis maintained by DCIF is better than that main

tained in SWAT structure for both the real and synthetic

time series. In addition, DCIF effectively performs incre

mental maintenance of the wavelet based synopsis over a

set of sliding windows and enables the tracking of the vari

ation in quality of the maintained synopses.

9


0

2

4

6

8

10

0 200 400 600 800 1000

time point

L2 E

rror

SWAT(21)

DCIF21)

0.8

1.2

1.6

2

0 200 400 600 800 1000

time point

L2 E

rror

SWAT(21)

DCIF(21)

(b) Moscow

(a) Kuala Lumpur

5

10

15

20

25

30

0 200 400 600 800 1000

time point

L2 E

rror

SWAT(21)

DCIF(21)

(c) RandomWalk1

5

10

15

20

25

30

0 200 400 600 800 1000

time point

L2 E

rror

SWAT(21)

DCIF(21)

(d) RandomWalk2

Figure 9. Comparison of synopsis quality ondifferent datasets using algorithms SWAT andDCIF. Note that =21 for all tests.

Algorithm DCIF not only incrementally maintains

wavelet synopses but also incrementally updates the cur

rent error due to thresholding such that there is no need to

reconstruct from the wavelet synopses and go through the

entire window to calculate the 2 norm. We execute algo

rithm DCIF with different values for the top threshold

ing, i.e., the number of available storage units and the num

ber of coefficients kept in the wavelet synopses. The qual

ity of the wavelet synopses maintained by DCIF and SWAT

are shown in Figure 10 for the Kuala Lumpur series. This

agrees with our intuition since the top thresholding con

trols the error of the synopses and the synopses via wavelet

transform only needs a few coefficients to achieve reason

able quality.

0.7

0.9

1.1

1.3

1.5

1.7

0 200 400 600 800 1000

# of time points

Avg

. L

2-e

rro

r

DCIF(W=256)

DCIF(W=128)

DCIF(W=64)

Figure 10. Quality (in terms of 2 errors) ofwavelet synopses for different values of .

Note that retrospective period ,which equals the win

dow length, determines the number of historical data points

covered by the synopsis. To make effective use of the lim

ited storage space, we would like to maximize the window

length in order to maximize the information captured by the

wavelet synopsis.

Results showing the quality of the synopses for different

window lengths with the number of storage units fixed are

provided in Table 4. It is observed that the synopsis main

tained via SWAT structure with window length 64, i.e., the

number of storage units required is 18, is not capable of

answering queries beyond its window. However, the syn

opsis maintained by algorithm DCIF with the same number

of storage units is able to cover a window of 2 times as

much in length while the quality is better. Moreover, with a

slightly 10% increase in the average 2 norm error, the ret

rospective period can be extended to 4 times as much, i.e.,

W=256. For a time window of such size, the SWAT struc

ture requires 24 storage units to reach the acceptable error

level. In other words, given fixed storage constraints, algo

rithm DCIF shows better storage utilization and effectively

increases the retrospective period of the wavelet synopses.

Consequently, algorithm DCIF provides synopses with bet

ter quality for further temporal query processing.

Table 4. Results of synopsis quality when varyingthe window size for algorithms SWAT and DCIF.

Algorithm W k Average 2 ErrorSWAT 64 18 1 32

DCIF 128 18 1 23

DCIF 256 18 1 46

SWAT 256 24 1 63

10


5 Conclusions

In this paper, we have presented techniques to incremen

tally maintain the wavelet based synopsis for applications

in a data stream environment. Our approach directly main

tains data synopses over consecutive sliding windows with

out the need to store or to reconstruct the detailed contents.

Specifically, algorithm DCIF incrementally maintains the

proposed SDT structure by exploiting the locality and the

multi resolution properties of the wavelet transform while

algorithm DCIR reconstructs the contents from the main

tained synopses to answer user queries. In addition, the

quality of maintained synopses in terms of 2 norm is also

incrementally tracked. Note that advantageous features of

proposed approaches are verified through extensive experi

mental studies in which the algorithm SWAT proposed in an

earlier work is also implemented for comparison purposes.

Generally speaking, approaches developed in this paper can

effectively and efficiently work along with the common

thresholding policy in an incremental fashion to maintain

the synopses with limited storage space as required by a

data stream environment.

References

[1] Average Daily Temperature Archive of the University

of Dayton. http://www.engr.udayton.edu/weather/.

[2] A. Arasu and G. S. Manku. Approximate counts and

quantiles over sliding windows. In Proceedings of the

Twenty third ACM Symposium on Principles of Data

base Systems, pages 286–296, June 2004.

[3] B. Babcock, M. Datar, R. Motwani, and

L. O’Callaghan. Maintaining variance and k

medians over data stream windows. In Proceedings

of the Twenty Second ACM Symposium on Principles

of Database Systems, pages 234–243, June 2003.

[4] A. Bulut and A. K. Singh. SWAT: Hierarchical Stream

Summarization in Large Networks. Proceedings of the

19th International Conference on Data Engineering,

pages 303–314, March 2003.

[5] C. S. Burrus and R. A. Gopinath. Introduction to

Wavelets and Wavelets Transforms. Prentice Hall,

1997.

[6] M. Datar, A. Gionis, P. Indyk, and R. Motwani. Main

taining stream statistics over sliding windows: (ex

tended abstract). In Proceedings of the thirteenth an

nual ACM SIAM symposium on Discrete algorithms,

pages 635–644. Society for Industrial and Applied

Mathematics, January 2002.

[7] A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and

M. Strauss. Surfing wavelets on streams: One pass

summaries for approximate aggregate queries. In

VLDB 2001, Proceedings of 27th International Con

ference on Very Large Data Bases, pages 79–88, Sep

tember 2001.

[8] E. Keogh, K. Chakrabarti, S. Mehrotra, and M. Paz

zani. Locally Adaptive Dimensionality Reduction for

Indexing Large Time Series Databases. Proceedings

of the 2001 ACM SIGMOD International Conference

on Management of Data, pages 151–162, May 2001.

[9] E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehro

tra. Dimensionality Reduction for Fast Similarity

Search in Large Time Series Databases. Knowledge

and Information Systems, 3(3):263–286, August 2001.

[10] I. Lazaridis and S. Mehrotra. Capturing Sensor

Generated Time Series with Quality Guarantees. Pro

ceedings of the 19th International Conference on Data

Engineering, pages 429–440, March 2003.

[11] T. Li, Q. Li, S. Zhu, and M. Ogihara. A Survey on

Wavelet Applications in Data Mining. SIGKDD Ex

plorations, pages 49–68, December 2002.

[12] Y. Matias, J. C. Vitter, and M. Wang. Wavelet Based

Histograms for Selectivity Estimation. Proceedings of

the 1998 ACM SIGMOD International Conference on

Management of Data, pages 448–459, June 1998.

[13] Y. Matias, J. S. Vitter, and M. Wang. Dynamic main

tenance of wavelet based histograms. In VLDB 2000,

Proceedings of 26th International Conference on Very

Large Data Bases, pages 101–110, September 2000.

[14] I. Popivanov and R. J. Miller. Similarity Search over

Time Series Data Using Wavelets. Proceedings of the

18th International Conference on Data Engineering,

pages 212–221, February 2002.

[15] W. G. Teng, M. S. Chen, and P. S. Yu. A Regression

Based Temporal Pattern Mining Scheme for Data

Streams. Proceedings of the 29th International Con

ference on Very Large Data Bases, pages 93–104, Sep

tember 2003.

[16] B. K. Yi and C. Faloutsos. Fast Time Sequence Index

ing for Arbitrary Lp Norms. Proceedings of the 26th

International Conference on Very Large Data Bases,

pages 385–394, September 2000.

11


Finding Temporal Association Rules betweenFrequent Patterns in Multivariate Time Series

Giridhar TatavartyECECS Department

Univesity of CincinnatiCincinnati, OH 45221, USAEmail: [email protected]

Raj BhatnagarECECS Department

University of CincinnatiCincinnati, OH 45221, USAEmail: [email protected]

Abstract— We consider the problem of mining multivariatetime series data for discovering (i) frequently occurring substringpatterns in a dimension, (ii) temporal associations among thesesubstring patterns within or across different dimensions, and(iii) large intervals that sustain a particular mode of operation.These represent patterns at three different levels of abstractionfor a dataset having very fine granularity. Discovery of suchtemporal associations in a multivariate setting provides usefulinsights which results in a prediction and diagnostic capability forthe domain. In this paper we present a methodology for efficientlydiscovering all frequent patterns in each dimension of the datausing Suffix Trees; then clustering these substring patterns toconstruct equivalence classes of similar (approximately matching)patterns; and then search for temporal dependencies among theseequivalence classes using an efficient search algorithm. Modes ofoperation are then inferred as summarization of these temporaldependencies. Our method is generalizable, scalable, and can beadapted to provide robustness against noise, shifting, and scalingfactors.

Multi-attribute time series data occurs in various domainssuch as finance, science and engineering applications, weatherand network traffic monitoring. These datasets contain re-peated occurrences of some patterns with minor variabilityamong different occurrences. Discovery of such repeating pat-terns provides insights into the domain’s underlying process.Important tasks in mining time series data address the issuesof similarity search among different patterns, discovery offrequently occurring patterns and the task of finding temporalassociations among groups of frequently occurring patterns.Consider the following simple example. Weather monitoringdata for a year may contain a scattered but repeating patternof temperature going up for three continuous days. It mayalso contain another repeating pattern of temperature fallingfor three continuous days. Other repeating patterns may be forbuildup in pressure for a few days in a row and some particularvariation pattern for humidity for a few days in a row. Ourfirst objective is to discover these repeating patterns in eachdimension (such as pressure, temperature, humidity). Thesecond major objective is to discover temporal dependenciesamong the patterns occurring either within the same dimensionor across dimensions. As an example of the latter, we may beinterested in discovering a dependency such as: that the risingtemperature pattern is followed by a rising pressure pattern andit also overlaps with a particular buildup pattern in humidity.As final summarization, we may want to identify say, the two

45-day long periods during which this temporal association isobserved. In this paper we demonstrate achieving all of theseobjectives.

Techniques for association rule mining[6], [3] with sequencedata, frequent episode mining [5], temporal rule mining [2],interval data mining [1], [4], [13] all solve only parts of theabove problem. These techniques can only find the temporalrelationships of follows and are designed to work on eithermarket basket data or interval data rather than multivariate timeseries data, which is significantly different in character. In thispaper we demonstrate discovery of temporal relations containsand overlaps in addition to follows. We use clusters of similartemporal substrings as equivalence classes to conquer thecomputational complexity resulting from explosion of differenttypes and sizes of substring patterns in time series data.

I. RELATED WORK

Much work has been done in finding similar patterns intime series data, indexing the time series and mining themfor repeating trends or patterns. Most such methods applytechniques [24], [25], [26], [27], [28] such as the Fourierand Wavelet Transforms, Dynamic Time Warping, PiecewiseApproximation, Shape and Structural Grammars [22], [23]. Various similarity measures [9] have been proposed tocompare patterns in time series from different perspectives.

The problem of finding recurring substring patterns [12],[8], [13] has been solved using a number of approaches for pe-riodic, non periodic and partially periodic patterns. The workdescribed in [8] efficiently uses suffix trees to mine frequentpatterns. But the number of patterns invariably becomes verylarge. The approach in [12] formalizes a repeating patternas a motif and presents EMMA algorithm to find k-motifs.Finding temporal dependencies between frequently occurringpatterns is a challenging and useful task. For example, itis more informative to know: If Stock A increases during aperiod and Stock B decreases during an overlapping periodthen Stock C increases in a following time interval thanjust knowing: Stock A, Stock B, Stock C increase, decreaseand increase in frequently recurring patterns. This kind oftemporal association helps to establish correlations betweenpatterns of underlying phenomena and enables us to developpredictive and diagnostic tools for the domain.


Some of the issues from multi-dimensional time series as-sociation rule mining can be translated into inter-transactionalassociation rule mining [6], but fundamentally the two prob-lems are quite different and require different approaches. Thepattern in a multivariate time series is a substring from asingle dimension along the time axis. In inter transactionalassociation rule mining there is no clear concept of eachdimension and data is available in the form of a set of temporalsequences. The work in [5] introduces the concept of episodesand the WINEPI algorithm to mine different types of frequentepisodes on event sequences. This algorithm is applicable forevent sequence data only and not for multidimensional timeseries data. The algorithm in [7] mines the event data forpatterns called m-patterns which are mutually dependent. Rulediscovery from time series by Das et. al. in [2] finds thetemporal rules in multi-dimensional time series data. Temporalrules such as A

−−−−−−−−→followed by B is done on basic shapes of

the time series patterns which are constrained by the windowsize on which these basic shapes are constructed. In thismethod the window size would dictate the shape of patternsdiscovered and also the temporal associations among them.The Temporal rules discovered are only of the type A followedby B and do not consider many other possible [10] temporaldependencies such as those shown in figure 2. The researchpresented in [11] discovers temporal containment in the eventtime series where events are intervals rather than time points.Work described in [18] finds the sequential patterns of typeA

−−−−−−−−→followed by B where A and B are subsequences of the

time series database. The work in [18] uses suffix tree patternsfor all rules. The work in [1] extends temporal associationrule mining to interval based events and gives an algorithmfor finding temporal dependencies. We consider more generaltypes of temporal rules which include a subset of Allen’s [10]all possible temporal relationships among events.

Contribution: Our primary contribution is an efficientmethodology to discover temporal associations among patternsin a multi-dimensional time series dataset. It can toleratesignificant noise levels in data, supports patterns of differentlengths within one equivalence class, finds temporal associ-ation rules among the equivalence classes of string patternsand summarizes. Our methodology goes beyond the ideaspresented in [2] in two different ways: (i) it can find temporalassociations between patterns of different length without therequirement of knowing the window-size of the pattern, and(ii) it explores many different types of temporal associationsamong patterns.

Paper Organization: Rest of this paper is organized asfollows. In Section II we define the formal problem. In SectionIII we discuss the process of quantization of the time seriesdata to convert it to symbol strings. Section IV describes ourspace efficient suffix-tree based search for frequent patterns ineach dimension of the time series data. Section V presentsclustering of frequently occurring substrings into a muchsmaller number of equivalent classes. Section VI presentsthe algorithm for discovering temporal dependencies betweenthe equivalence classes of substrings. Section VII presents

Fig. 1. Constraints for Temporal Relation between two position pairs X,Y

the higher level summarization of temporal dependenciesdiscovered in section VI. Finally, Section VIII discusses theexperimental results along with the conclusions in Section IX.

II. PROBLEM DEFINITION

In this section we introduce the basic notions required todefine and formulate the problem.

Definition 1: A time series X is a finite sequence of realvalues (x1, x2, x3...xn). A multi-dimensional time series in anm-dimensional space, with n observations in each dimensions,is represented by m sequences:X0 = (x00, x01, x02...x0n−1)X1 = (x10, x11, x12...x1n−1)...Xm−1 = (x(m−1)0, x(m−1)1, x(m−1)2...x(m−1)n−1).(x0j , x1j , x2j , x3j ...x(m−1)j), where j = 0 ...n-1, is called theobservation column and all the values in the column have thesame time stamp.

Our symbolic computation framework requires that the timeseries be discretized and converted into a symbolic represen-tation using a finite alphabet set Σ. The discretization levelsmay be different for each dimension of the time series data. LetQuant(Xi, xij , ki) be a discretization function which mapsreal value xij into positive integer value l, 0 ≤ l ≤ k−1. Thenumber of maximum levels ki into which values for a dimen-sion Xi are discretized can be different. Let Σi be an orderedalphabet set which contains alphabets (a0, a1, a2...a|Σi|−1)where | Σi | is the size of the alphabet set Σi.

Definition 2: Symbolic representation of a time series Xi

is a mapping of real values of Xi into a quantized symbolicstring Si = si0si1si2...sin such that sij = ail where l =Quant(Xi, xij , ki) i.e ail is the lth alphabet of Σi.

Definition 3: A pattern p of length w<n is a contiguoussubstring (sji, sji+1...sji+w−1) of string Sj where i+w−1 ≤n − 1 and 1 ≤ i ≤ n − w + 1.

A pattern p is called frequent pattern if the number ofoccurrences of p in Si ≥ MIN SUPPORT . In the followingdiscussion we use Pattern to refer to a frequently occurringsubstring.

Definition 4: Let C = p0, p1, p2...pmax, 1 ≤ max, bethe set of all the patterns(substrings) from a sequence S.Clustering is partitioning of C into subsets C0, C1, ...Cm′−1


such that :1. Cz = ∅, z = 0...m′ − 12.

⋃m′−1z=0 Cz = C

3. Cy

⋂Cz = ∅; y = z; y, z = 0, 1...m′ − 1

Partitioning of the set C is done by a clustering algorithm(V) which groups all the similar patterns into an equivalenceclass and treats each class as a single pattern recurring withsome noise. The number of possible patterns generated by adataset typically explodes exponentially and thereby the com-putational complexity of the mining process becomes exponen-tial. The clustering step significantly helps in controlling thiscomplexity. By controlling the number of equivalence classespermitted and the characteristics of the clusters generated,we can control the nature of the patterns and their temporaldependencies that are discovered. Each equivalence class maycontain patterns that are different due to symbol substitutionand also in length of the substring.

Definition 5: A Position pair is an ordered pair (ps, pe)where ps ≤ pe and ps, pe are both positive integers. A positionpair is used to represent the location and duration of a singleoccurrence of a pattern in a given sequence. The number ps,called the starting position denotes the starting time of thepattern p in Si, and pe, called the ending position, denotesthe ending time of the pattern p in Si. For example, pattern’abc’ has a position pair (3,5) in the sequence ’efabcdef’. Twoposition pairs A,B such that A = (as, ae), B = (bs, be) andas ≤ bs overlap each other iff bs ≤ ae. Two overlappingpositions pairs A,B are said to be merged to form a singleposition B

′such that B

′= (as,max(ae, be)). For example,

in the sequence ”ababa” the position pairs for ”aba” (0,2) and(2,4) overlap and the merged position pair for (0,2), (2,4) isgiven by (0,4).

We now define a list of positions at which different oc-currences of a pattern appear in a a sequence Si. A clusterposition list is then created by merging the position lists of allthe patterns that belong to the same equivalence class.

Definition 6: A Position List L = (as1, ae1), (as2, ae2),...(ask, aek) is an ordered list of position pairs sorted bystarting positions such that any two overlapping position pairsare merged until no overlapping positions are left. A Positionlist has two important properties

1. All the position pairs are sorted by their starting position.i.e as1 < as2... < ask

2. No two position pairs overlap. i.e ae1 < as2, ae2 <

as3, ..., aek−1 < ask

A Position list for a pattern p in a sequence Si is the orderedlist of position pairs for all occurrences of p. This is obtainedby merging all overlapping occurrences (position pairs) of p inSi until no overlapping position pairs are found. For example,position list for pattern ’aba’ in sequence ’ababacdaaba’ isgiven by (0,4),(8,10), here the position pairs (0,2),(2,4)are merged to (0,4). The idea of position list is extended toequivalence classes also. Since each class is a set of patterns, aposition list for a cluster is defined as the union of all positionlists corresponding to the patterns contained in the cluster

Fig. 2. Temporal Relation between two position pairs a,b and Position ListsA,B

such that overlapping positions are repeatedly merged to forma non overlapping sorted position-pair list. Cluster positionlist is, therefore, the list of all locations at which equivalentpatterns occur, with possibly some noise, in a sequence. Thecluster position list is not unique for a dataset and dependsupon the clustering algorithm, quantization, and similaritymeasures used for clustering. Conceptually, by finding clusterposition lists we are constructing an interval database froma sequence database and then finding temporal associationson the interval database using technique similar to [1]. Theconcept of position pair is the same as of an event in anevent interval database while position list is same as an eventsequence.

Definition 7: A Temporal Relation Rel ∈overlaps, followedby, contains is defined for positionpairs and position lists as follows.

Let A (as, ae), B (bs, be) where as ≤ bs be two positionpairs and WIN be a user defined window size, then relationRel(A,B), Rel ∈ overlaps, followedby, contains is definedas

followedby(A,B) is true ⇐⇒ (ae < bs) ∧ (bs − ae ≤WIN). Also ( A followedby B) is a new position pair givenby ( as, be).

contains(A,B) is true ⇐⇒ (as < bs) ∧ (be < ae). Also (A contains B) is a new position pair given by ( as, ae).

overlaps(A,B) is true ⇐⇒ (bs ≤ ae) ∧ (ae ≤ be). Also( A overlaps B) is a new position pair given by ( as, be).This definition of overlap contains Allen’s [10] definition ofmeets, starts, equals and overlaps. By combining the meaningof meets, starts, equals and overlaps into single definition ofoverlap significantly reduces the number of temporal asso-ciation rules in the search space. See figure 2 for examplesof temporal relations overlap, contains, followedby betweentwo position pair A and B. For example,with position pairsA(1,12) and B(15,24) and window WIN=5, we can see thatFollowedby(A,B) is true and new position pair (A followedbyB) = (1,24) is created. Similar position pairs are created forother relations overlaps and contains.


The concept of temporal relations is extended to positionlists from position pairs. Temporal relations between positionpairs are defined in terms of position pairs, i.e even if certainnumber of position pairs in list satisfy temporal relationshipthen the position pair is said to satisfy the temporal rela-tionship. Specifically, Let a be any position pair and L bea position list and Rel ∈ overlaps, followedby, contains.Then a mapping function δ(a, L,Rel) is defined as:

δ(a, L,Rel) = 1 if ∃position pair b ∈ L [ Rel(a, b) = true]else δ(a, L,Rel) = 0. Given a user defined minimum support,temporal relation Rel for two position lists L1, L2 is definedas Rel(L1, L2) is true iff∑

ai∈L1

δ(ai, L2) ≥ min support

else (L1 Rel L2) is false. Also a new position list is createdas(L1 Rel L2) =

⋃(a Rel b), a ∈ , b ∈ L2, = a ∈ L1 |

Rel(a, b) is trueTemporal Relation, Rel(A,B) for position lists A, B is said

to be true if number of position pairs in A, which satisfyRel(a, b) a ∈ A, b ∈ B is greater than the minimum supportand the corresponding position list (ARelB) is constructed byunion of all the position pairs which satisfy Rel(a, b).Againthe position pairs are merged during union such that theproperties of a position list are satisfied.

Definition 8: A Temporal Association Rule between twoPosition lists is recursively defined as follows. This definitionis based on [1].

1. if X is a Position list then it is a temporal associationrule. It is also called an Atomic Temporal Association Rule

2. if Rel(X, Y ) is true and X,Y are temporal associationrules then (X Rel Y ) is also a temporal association rule.

The size of a temporal association rule is the number ofatomic temporal association rules present in it. A rule of sizek is called k-rule.

Problem Definition: The problem is to find large k -rules given the parameters min support, WIN and the symbolicrepresentation Si for each feature in the time series. The giventime series is first converted into symbolic representation asexplained below.

III. SYMBOLIC REPRESENTATION

Each time series sequence is discretized to form a string,and each dimension can have a different alphabet size. Thesize of alphabet may depend upon domain and also theamount of information we would like to keep. [17] providesan extensive survey of discretization methods. We can seethat as the alphabet size reduces, the average pattern lengthgrow and the average number of clusters are also reduced.Though there are many discretization methods possible, wehave implemented unsupervised equal-frequency discretiza-tion algorithm which is not sensitive to outliers comparedto equal-interval discretization. The choice of discretizationand symbolic representation is dependent upon the features,

TABLE I

EXAMPLES FOR THE TERMS USED

Time Series X1 = 1,2,4.1,3.2,5.4,2.2,4.6,1.2,2.4,4.3,5.33,5.1,1.2,2,6,3,5

time series sequence S1 = abdcebdabdeceabfcefrequent patterns C = bd,abd,ce,ab position pairs (0,1),(7,8),(13,14)for ’ab’position list pab=(0, 1), (7, 8), (13, 14)

pabd=(0, 2), (7, 9)pce=(3, 4), (11, 12), (16, 17)

clusters C1 = ab,abd C2 = bd C3 = ce

cluster position list Pc1=(0, 2), (7, 9), (13, 14)Pc2=(1, 2), (5, 6), (8, 9)Pc3=(3, 4), (11, 12), (16, 17)

temporal association C1followedby C3

rule; min support=2;WIN=3;position List (0, 4), (7, 12), (13, 17)(C1followedbyC3)

characteristics of the time series we would like to mine.For example, should we wish to mine the trends instead ofpatterns in Time Series we can choose a method similar to[8]. Quantization causes some loss of original information buta later step in our process clusters similar strings togetherand considers them as one equivalence class. This latter stepsignificantly neutralizes the loss of information.

Also, if we wish to mine for the dynamics of the timeseries we would construct the differential of time series withrespect to time and then discretize the time series. Suchtime series will preserve the similar patterns occurring atdifferent offset values. Below are the steps for a modifiedequal-frequency discretization function( assigns the samelevel for equal values ) which takes a time series X anddiscretizes X with a maximum of k levels

1: Sort the time series X, X ′ = sort(X) while keeping anreverse index to point to original time series, i.e I(j) = iwhere x′

j = xi are the same elements.2: Initialize bin size := n/k, level:= 03: for =each¯ element x′

j < n do4: if x′

j = x′j−1 then

5: assign x′j same level as x′

j−1

6: else7: if current level has elements more than binsize then8: increase the level and assign to x′

j

9: else10: assign x′

j the current level11: end if12: end if13: end for14: Using reverse index I, map the sorted time series to

original time series

The mining process is independent of any particular reason-able quantization as long as the quantization does not compressor stretch along the time scale.


Fig. 3. Enhanced Suffix Tree for abdcebdabdeceabfce$

IV. FINDING FREQUENT PATTERNS

Once the suffix tree has been constructed, the number oftimes a particular pattern occurs equals the total number ofleaf nodes it has under it. Also the locations of each of theseoccurrences can be traced back by following the pointersto the string from each of the leaf node. For example infigure 3 the pattern ce repeats 3 times and it has 3 leafnodes. One of the challenges involved in such enumerationis the huge amount of data produced for keeping track ofevery pattern and all the locations at which the patternoccurs. Even enumeration of patterns and all their locations iscomputationally expensive. To solve this we use an efficientencoding scheme for storing all the locations for all thepatterns. This encoding scheme still preserves the linear timecharacteristic of the algorithm while enumerating all thepatterns and corresponding locations of each pattern. It isalso space efficient to store the locations and does not takemore space than the total number of leaf nodes in the tree tostore the location information. Listed below are the steps forconstructing enhanced suffix tree to enumerate all frequentsubstrings in a sequence

1: Construct the suffix tree using the Online-Suffix TreeConstruction Algorithm such thateach leafnode contains the pointer called leafindex whichpoints to the starting locationinside the sequence for the pattern ending on the node.

2: Order the children of each node alphabetically on the edgeconnecting the child node.

3: Initialize leaforder :=04: Starting from root, depth-first traverse the tree and update

thefollowing information at each node.

5: if traversing from parent to children then6: populate the child node with length of string7: traversed along the path from root to childnode called

patternlength8: end if

9: if traversing from child to parent then10: increment the leaf’s child count at the parent by adding

the child’s child count of leaf nodes11: if Child is left most child then12: Store a pointer to child’s leftmost leafnode called

leftptr13: end if14: end if15: if node is leafnode then16: store leaforder and increment leaforder by 117: L(leaforder) :=(leafindex - patternlength)18: end if19: depth-first traverse the tree again20: for all nodes inside the tree do21: if node.child count ≥ MIN SUPPORT then22: for i = 0 to node.child count -1 do23: print L(leftptr.leaforder + i)24: end for25: end if26: end for

As we can see in the above algorithm, for any patternthe locations at which it occurs can be given by only twoparameters, leftptr.leaforder and child count on the Leaf Array.So, for every pattern we can enumerate all the locations byspecifying only two values. Since there is only one Leaf Arraywe can store and enumerate all the locations of all the patternsby storing only Leaf Array and two parameters for eachpattern. As an example, let us consider the pattern bd in Figure3 . bd has 3 leafnodes starting at positions 7,3,10(also calledleafindex). The pattern length of bd is 2 and the Leaf Arrayvalues corresponding to each of leafnodes of bd is 5,1,8. Theleaforder for the leftmost child (abdeceabfce) of bd is 3. Thelocations are given by L(3),L(4),L(5) which are 5,1,8. Now letus consider pattern b,the leaforder of leftmost leafnode of bis 3 and child count is 4. So b occurs at L(3),L(4),L(5),L(6)which happens to be 5,1,8,14.

Pruning of patterns is also possible since the frequency ofpattern as well as the length are maintained at each node.Pruning can be done by outputting only those nodes whichhave frequency, length greater than some threshold whiledoing the depth-first traversal of the suffix tree. More complexpruning such as removing substrings can also be performedby taking into node and edge information while depth-firsttraversal. Finally, the order of the algorithm to find frequentpatterns and all their locations is linear in the length of timeseries since [21] is linear and depth-first traversal of the suffixtree is also linear.

V. CLUSTERING

Suffix trees allow for finding the exact match of the string,but the data may contain noise and the same pattern can occurwith noise at different locations. This noise can be of substitu-tive, insertion and deletion noise. Substitutive refers to patternoccurring with one or more of the symbols mismatched. Forexample ”aabbcc” can occur as ”babbcc” or ”aacbcc”. Inser-tion noise refers to pattern occurring with one or more symbols


inserted into the pattern. For example ”aabbcc” occurring as”aabbacc” or ”aabbccc”. Deletion noise refers to one or moresymbols missing in the pattern ”aabbcc” occurring as ”aabbc”or ”abbcc”. There are many symbolic similarity measures [9]which capture this kind of noise such as edit distance, longestcommon subsequence(LCS). The similarity measure which wehave chosen to cluster the strings into equivalence classes, isbased upon longest common subsequence. The advantage ofsuch measure is that it allows for gaps in between the symbolsof a pattern while matching them for similarity. Formally ifs1 , s2 are two strings then similarity measure:

Sim( s1, s2) = 2∗LCS(s1,s2)(|s1|+|s2 |)

andDist(s1, s2) = 1 - Sim( s1,s2) is the distance measure.

LCS(s1, s2) is the length of longest common subse-quence(common subsequence in s1, s2 of maximal length).| s | is the length of string s. The advantage of choosingsuch a non-metric similarity measure over the metric such asedit distance or Levenshtein Distance is that this similaritymeasure scales with the length of the strings which are beingcompared. For example if we keep an LCS threshold smallthen all the strings whose sizes are very larger to threshold willallow room for a lot of noise and shorter strings strings willmatch correctly,consequently if we keep the threshold largethen larger strings will match correctly but shorter will notmatch due to threshold too high. So we take the ratio betweenLCS and the average length of the both strings as the similaritymeasure So this measure will work for strings of all lengthranges. We can see that when the two strings are exactly samethe similarity measure becomes 1 and the distance becomes 0.The LCS between two strings can be calculated in O(mn/w)time [30] where m,n are the lengths of the strings and w isthe machine bit width. This can be expensive for large strings,so one optimization we have implemented is a histogramcomparison for large strings and the edit distance is calculatedonly if the histograms match within a certain threshold. Ifthe two strings are similar then the edit distance can becalculated more efficiently using O(ND) Difference Algorithm[29] where D is the minimum edit distance and N is m + n.

Using the above LCS based similarity measure we clus-ter the patterns, Since for the large databases the numberof patterns for each dimension can run up to hundreds ofthousands, doing a hierarchical clustering is too expensive.So we use sequential clustering which is less expensive.We have also reduced the number of candidate patterns forclustering by dividing the patterns into overlapping bucketsbased upon pattern length and clustering the patterns insideeach bucket only. Since the patterns in different buckets willhave large variation in lengths their similarity measure willbe low and they will not be in same cluster anyway. Alsohaving overlapping buckets helps placing the pattern in theclosest cluster of adjacent buckets. The patterns are sortedby length and then alphabetically. A pattern is assigned to aparticular bucket if the pattern length falls into the range of thebuckets. Then clustering is done between all the patterns insidea bucket. Finally after clustering is done, if pattern is part of

multiple clusters then it is assigned to the cluster which hasmaximum similarity measure. The Outline of the algorithm isas follows:

1: Sort the patterns pj ∈ S′ by length(pj) and then alphabet-ically.

2: Let Bi be a bucket and Bi.start, Bi.end be start and endrange of Bi. B = B0,B1, B2 ... Bmax be the set of allbuckets.

3: for each bucket Bi ∈ B do4: Bi = Bi

∨pj , iff Bi.start ≤ length(pj) ≤ Bi.end

5: end for6: for each bucket Bi do7: Cluster all patterns pj ∈ Bi in the bucket Bi

using two threshold sequential clustering algorithm,producing counti number of Clusters. //any clusteringalgorithm can be used in this step

8: end for9: min distance = 1;

10: for each pattern pj do11: for each cluster Ci such that pj ∈ Ci do12: if Dist(pj , Ci) ≤ min distance then13: pj .clusterno := Ci and min distance:= Dist(

pj , Ci)14: end if15: end for16: end for

The details of the two-threshold Sequential AlgorithmicScheme can be found in [31]. Although any kind of clusteringalgorithm can be used in step 4, the choice of TTSAS isjustified by lower complexity and lesser dependence upon theorder in which the values are input. In TTSAS, the Dist(pj ,Ci ) is computed as the Dist( pj , pr) where pr is the clusterrepresentative. pr is chosen such that it has the least averagedistance from all the patterns in that cluster. Whenever apattern is added to the cluster, a new cluster representativeis chosen.

VI. TEMPORAL ASSOCIATION RULES

The clustering of patterns brings together all the similarpatterns, with respect to a threshold, into the same equivalenceclass. Once we have the clusters we need the position lists foreach cluster, so that we can find the temporal relationshipsbetween the clusters. Then we incrementally mine for higherorder temporal relationships between clusters, This is whatwe call as Temporal Association Rules. The Position List foreach of the clusters is constructed as follows

for each Cluster Ci

for each position pj ∈ Ci Create a Position List Lpj from Leaf ArrayLCi = Lpj ∨ LCi

The relationship between two Cluster Position List canbe of any one of the followed by, Overlaps, Containsshown in figure 2. Moreover the three relationships mentioned


above conceptually captures most of the temporal relationshipspossible and the commonly mined relationships. The first stepto mine for temporal association rules is to start with atomicrules and incrementally mine for higher order rules in a levelwise algorithm. This means discovery of level 2 rules, followedby level 3, followed by level 4 until the maximum levelspecified is reached.

Mining for 2-rules is done by doing all-to-all comparison ofall the cluster position lists. Let C be the set of all the clusterposition lists, min support be minimum value of support andmin confidence be minimum value of confidence. C2 be theset of 2-rules initialized to ∅

1: level = 12: Clevel = C

3: while level + 1 < max level do4: for each cluster ci in Clevel do5: for each cluster in cj in C do6: for each relation rel ∈followed by, Overlaps,

Contains do7: if (support(ci Rel cj) > min support) then8: Clevel+1 = Clevel+1

⋃(ci rel cj)Add the

newly generated position list to candidates fornext level

9: Compute the Position List for ci rel cj

10: level = level + 111: end if12: end for13: end for14: end for15: end while

The support(ci Rel cj) is defined as the number of positionpairs in ci for which Rel(ci, cj) is true. The confidence of

(ci Rel cj) is defined as|⋃

ai||ci| where ai is a position pair,

(ai ∈ ci)∧

(Rel(ai, cj) is true) i.e the number of positionpairs in ci for which Rel(ci, cj) is true divided by the totalnumber number of position pairs in ci . This kind of rulesare similar to A-1 type of mentioned in [1]. After finding allthe frequent k-rules rules are pruned out for high confidencerules. We can also modify the algorithm to output any level-rule if the confidence is greater than some threshold. Whileconfidence is one way to find relevant and important rulesSummarization is another way to accomplish this task.

VII. SUMMARIZATION

Sometimes higher level or more general knowledge mightbe inferred from lower level temporal dependencies. Sum-marization is the process which generalizes the temporalassociation rules by identifying the time windows in which therule is applicable. The measures for summarization used arecoverage, average length of coverage and maximum coveragelength. Coverage tells about the importance of applicabilityof the rule on the given data. Larger the coverage, larger isthe rules relevance and applicability. Average coverage lengthgives domain related information such as the patterns occurtogether contiguously as a stretch ( for example seasonal pat-

terns, modes of operation, events ) or whether they occur ad-hoc, random or as discrete occurrences. Maximum Coveragegives us the longest stretch of occurrence of the temporal ruleindicating an important event or condition. The exact meaningof the average coverage and maximum coverage are verydomain dependent. Summarization creates a new position listfrom the existing position list of the rule being summarized, bycombining all the position pairs which are closer than distanceSUMMARY WIN. The output is a new position list muchshorter than the original position list. Below is the definitionfor summarization.

Definition 9:Let Pij be position list corresponding toci rel cj then Summ(Pij ,WIN) is a summarized positionlist corresponding to ci rel cj and window length SUM-MARY WIN such that for any two position pairs a,b in Pij

where a < b and bs - ae ≤ SUMMARY WIN are mergedto form a single position list c = ( as, be) until there are nofurther position pairs a,b with bs - ae ≤ SUMMARY WIN.

Summarization finds the regions where rule is applicableand the measuring the extent of this regions would indicateshow important that particular rule is. Coverage is one suchmeasure which calculates the extent of the regions where therule is applicable. Coverage is defined as,

Definition 10: Coverage measures percentage of the timepoints in the time series for which the summarized temporalrule applies. Let SPij be summarized position list corre-sponding to ci rel cj and window SUMMARY WIN, thencoverage(ci rel cj) is defined as∑

p∈SPij| p |

| pmax |where pmax is the position pair (0,n) and | pmax | = n and | p |for position pair p = pe−ps +1. The average coverage lengthmeasures average length of the position pair in the summarizedposition list. Average coverage is given by∑

p∈SPij| p |

| SPij |. The maximum coverage is defined as the length of the longestposition pair in the summarized position list SPij and it isgiven by max(| p |) where p ∈ SPij .

For an example of summarization, we take the energy dataset of industry having two variables, pressure and current of acompressor. Each temporal rule maps to mode of operationof the compressor and the coverage tells us how long thecompressor runs in a particular mode and when does themodes of operation change and this information is valuableto optimize the current consumption.

VIII. EXPERIMENTAL RESULTS

Utility Dataset:

The experiments were done with a real large data set whichrecorded production parameters of large compressor. The at-tributes measured were current, pressure inside the compressorand discharge pressure at intervals of 10 seconds for 1 month,this constitutes 259200 observations. Each dimension was


(a) rule =(46 contains 87), support = 900, confidence = 0.9 (b) rule= (9 followedby 183) contains 113, support = 732,confidence = 0.91

(c) Symbolic representation of rule 46 contains 87 (d) Summarization of rule 46 contains 87

(e) rule between power consumed and barrels produced inbrewery dataset

(f) Scalability with EEG Dataset

Fig. 4. Temporal Association Rules


discretized with different levels of discretization using equalvalue discretization. In an sample run with minimum frequentpattern length of 6 characters there were a total 12567 frequentpatterns with a maximum pattern length of 1678 characters.After clustering there were 252 clusters and produced 480temporal association rules with a minimum support of 300(300occurrences of a pattern). The whole process took over 6minutes to mine rules of maximum length 4. A sample ruledenoting operating mode between current and pressure,(46contains 87) is shown in 4(a) and symbolic representation ofthis rule is shown in 4(c). The higher level summarizationof the rule is shown in 4(d). This rule actually mapped tothe operating mode pattern of the compressor during middaytime. A domain expert can use the values of average coverageand maximum coverage to identify the long operating modesand reduce the breaks in operating modes and thus cut powerconsumption costs.

EEG Dataset:

A sample of EEG data set from [14] has been processedwith our algorithms. The dataset contains 1 million pointseach containing a single electrode voltage value. In a samplerun there were 4935 frequent patterns and 120 clusters fromthe frequent patterns. The association rules searched were of”followed by” type since there was only one dimension. Thesummarization has identified long periods of repeating similarpatterns. This experiment also enabled us to test the algorithmwith large dataset including 1 million rows and show thescalability of our algorithms. The frequent pattern enumerationwith minimum support of 100 and minimum character lengthof 4, was done in 7 seconds on a Pentium 4, 2.8 Ghz machinerunning Windows. Mining temporal rules of length 4 withsupport 1% took 3 minutes. This performance is comparablewith the above mentioned utility dataset containing around250 thousand rows in 3 dimensions. figure 4(f) shows thescalability of the mining process with the size of dataset. Thealgorithm has been run with a minimum pattern length of 4characters and maximum length of rule of 4( 4-rules). Thisshows that the algorithm can be run in reasonable time fortime series lengths up to 1 million, after which the physicalmemory limitations seem to effect the performance.

Brewery Dataset:

Brewery data set consisted of monitoring of productionparameters for each day in a period of 1 year. The attributeswere power consumed, wet bulb thermometer reading, numberof barrels brewed. This was a small data set with only365 points and 3 dimensions, the results showed seasonalvariation in power consumption, strong temporal dependenciesof ”followed by” between the 2 of the attributes. An exampleassociation rule between energy consumption and number ifbarrels produces is presented in figure 4(e).

IX. CONCLUSION

We have presented a methodology for mining temporalassociations among frequent patterns occurring in multi-variate

time series data. This methodology seeks to control the expo-nential explosion of strings by clustering similar strings intoequivalence classes. We also seek to discover temporal asso-ciations among these classes from a richer set of possibilities.We have shown that this algorithm yields meaningful relation-ships in real-life data sets. The algorithm is also scalable andcan be applied to large datasets. The summarization aspectof our methodology identifies time periods during which aparticular temporal dependency holds. This is equivalent toidentifying modes of operation of a system. We have shownthat we can discover patterns at various levels of abstraction,starting from a very fine granularity data and in a scalablemanner. These capabilities give a user more power than anyother framework presented in literature.

REFERENCES

[1] Po-Shan Kam and Ada Wai-Chee Fu. Discovering temporal patternsfor interval-based events. In Yahiko Kambayashi, Mukesh K. Mohania,and A. Min Tjoa, editors, Second International Conference on DataWarehousing and Knowledge Discovery (DaWaK 2000), volume 1874,pages 317326, London, UK, 2000. Springer.

[2] Gautam Das, King-Ip Lin, Heikki Mannila, Gopal Renganathan,Padhraic Smyth, Rule Discovery from Time Series, Proc. Fourth Inter-national Conference on Knowledge Discovery and Data Mining (KDD-98), New York, New York, (August 27-31, 1998), pp. 16-22.

[3] Agrawal, Rakesh, Srikant, Ramakrishnan. Mining Sequential Patterns,ICDE ’95: Proceedings of the Eleventh International Conference onData Engineering, 1995.

[4] Roddick, John,F., Winarko, Edi. Discovering Richer Temporal Asso-ciation Rules from Interval-based Data : Extended Report, School ofInformatics and Engineering,Flinders University, 2005.

[5] H. Mannila, H. Toivonen, and A. I. Verkamo, Discovery of frequentepisodes in event sequences, Data Mining and Knowledge Discovery,1(3):259289, 1997.

[6] Hongjun Lu, Ling Feng, Jiawei Han: Beyond intratransaction asso-ciation analysis: mining multidimensional intertransaction associationrules. ACM Trans. Inf. Syst. 18(4): 423-454 (2000)

[7] Joseph L. Hellerstein, Sheng Ma: Mining Event Data for ActionablePatterns. Int. CMG Conference 2000: 307-318

[8] Udechukwu,A., Barker, K. Alhajj, R. (2004). Discovering All FrequentTrends in Time Series.In proceedings of the 2004 Winter Interna-tional Symposium on Information and Communication Technologies(WISICT 2004).Jan 5-8. Cancun, Mexico

[9] Dimitrios Gunopulos, Gautam Das: Time Series Similarity Measuresand Time Series Indexing. SIGMOD Conference 2001

[10] J.F. Allen. Maintaining knowledge about temporal intervals. Commu-nications eed of the ACM, 26(11):832–843, 1983.

[11] Villafane, R., Hua, K.A., Tran, D., Maulik, B.: Mining interval timeseries. In: Data Warehousing and Knowledge Discovery. (1999) 318-330

[12] J. Lin, E. Keogh, P. Patel, and S. Lonardi. Finding Motifs in Time Series. In proceedings of the 2nd Workshop on Temporal Data Mining, at the8th ACM SIGKDD International Conference on Knowledge Discoveryand Data Mining. Edmonton, Alberta, Canada. July 23-26, 2002.

[13] Hoppner, F. (2001). Discovery of temporal patterns learning rulesabout the qualitative behavior of time series. In Proceedings of the5th European Conference on Principles and Practice of KnowledgeDiscovery in Databases. Freiburg, Germany, pp 192-203.

[14] E. Keogh, The UCR time series data mining archive,http://www.cs.ucr.edu/ eamonn/TSDMA/index.html, University ofCalifornia Computer Science and Engineering Department, Riverside,CA, 2003.

[15] E. Keogh,S. Lonardi, and B. Y. Chiu. Finding surprising patterns ina time series database in linear time and space. In the 8th ACMSIGKDD International Conference on Knowledge Discovery and DataMining,July 23 - 26, 2002.

[16] C. Mooney, J.F. Roddick. Mining Relationships between InteractingEpisodes. In Proc. 2004 SIAM International Conference on DataMining, Orlando, Florida. Dayal, U. and Berry, M. W., Eds.


[17] Daw CS, Finney CEA, Tracy ER (2003). A review of symbolic analysisof experimental data. Review of Scientific Instruments 74: 916-930.

[18] K. Wang, J. Tan, Incremental Discovery of Sequential Patterns, Proc.SIGMOD Data Mining Workshop Research Issues on Data Mining andKnowledge Discovery, 1996

[19] Indyk, P., Koudas, N., and Muthukrishnan, S., Identifying representativetrends in massive time series data sets using sketches, in Proceedingsof the 26th Intl Conference on Very Large Data Bases, Cairo, Egypt,Sept 10-14, 2000, pp. 363-372.

[20] M. Nelson,Fast String Searching With Suffix Trees by Mark Nelson Dr.Dobb’s Journal August, 1996

[21] E. Ukkonen. On-line Construction of Suffix Trees. Algorithmica, 14(3)pp249-260, 1995.

[22] Morchen, F., Ultsch, A.: Discovering Temporal Knowledge in Multi-variate Time Series, Proc. GfKl Dortmund, Germany, 2004

[23] Morchen, F., Ultsch, A.(2004). Mining Hierarchical Temporal Patternsin Multivariate Time Series. In proceedings of the 27th GermanConference on Artificial Intelligence (KI). Sept 20-24. Ulm, Germany.

[24] Andr-Jnsson, H. & Badal. D. (1997). Using signature files for queryingtime-series data. In proceedings of Principles of Data Mining andKnowledge Discovery, 1st European Symposium. Trondheim, Norway,Jun 24-27. pp 211-220.

[25] Chan, K. & Fu, A. W. (1999). Efficient time series matching bywavelets. In proceedings of the 15th IEEE Int’l Conference on DataEngineering. Sydney, Australia, Mar 23-26. pp 126-133.

[26] Keogh, E,. Chakrabarti, K,. Pazzani, M. & Mehrotra (2000). Di-mensionality reduction for fast similarity search in large time seriesdatabases. Journal of Knowledge and Information Systems. pp 263-286.

[27] Perng, C., Wang, H., Zhang, S., & Parker, S. (2000). Landmarks: a newmodel for similarity-based pattern querying in time series databases.In proceedings of 16th International Conference on Data Engineering.

[28] Roddick, J. F., Hornsby, K. & Spiliopoulou, M. (2001). An UpdatedBibliography of Temporal, Spatial and Spatio-Temporal Data MiningResearch. In Post-Workshop Proceedings of the International Work-shop on Temporal, Spatial and Spatio-Temporal Data Mining. Berlin,Springer. Lecture Notes in Artificial Intelligence. 2007. Roddick, J. F.and Hornsby, K., Eds. 147-163.

[29] Myers, G. 1986. An O(ND) Difference Algorithm and Its Variations.Algorithmica 1, 2, 251-266

[30] Crochemore, M., Iliopoulos, C. S., Pinzon, Y. J., and Reid, J. F.2001. A fast and practical bit-vector algorithm for the longest commonsubsequence problem. Inf. Process. Lett. 80, 6 (Dec. 2001), 279-285

[31] S. Theodoridis.,K. Koutroumbas, Pattern Recognition, Academic Press,1998, pp 438-440


An Empirical Study on Multistep-ahead Time Series Prediction

Haibin Cheng, Pang-Ning Tan, Jing Gao, Jerry Scripps

Department of Computer Science and Engineering

Michigan State University

chenghai,ptan,[email protected], [email protected]

Abstract

Multistep-ahead prediction is the task of predicting a

sequence of continuous values in a time series. A

typical approach is to apply a regression model step-

by-step and use the predicted value of the current time

step to determine its value in the next time step. This

approach is known as multi-stage prediction. In this

paper, we investigate two alternative approaches

known as independent value prediction and parameter

prediction. The first approach builds a separate model

for each prediction step using only the values observed

in the past. The second approach fits a parametric

function to the sequence of time series values and

builds a separate model to predict each parameter of

the fitted function. The estimated parameters are then

used to reconstruct the shape of the predicted time

series. We perform a comparative study among the

three prediction approaches using multiple linear

regression (MLR), recurrent neural networks (RNN),

and hybrid Hidden Markov Mode/Multiple Linear

Regression (HMM/MLR) as the underlying regression

methods. The advantages and disadvantages of each

prediction approach are analyzed in terms of their

error accumulation, smoothness of prediction, learning

difficulty, and efficiency.

1. Introduction

In many time series forecasting problems, we are

interested in predicting a sequence of future values

using only the values observed in the past. This task is

also known as multistep-ahead time series prediction

[1]. Examples of such task include predicting the time

series for crop yield, climate indices, stock prices,

traffic volume, and electrical power consumption. By

knowing the sequence of future values, we may derive

interesting properties of the time series such as its

projected amplitude, variability, onset period, and

frequency of abnormally high or low values. For

example, multistep-ahead time series prediction allows

us to forecast the growing period of corn for next year,

the maximum and minimum temperature for next

month, the frequency of El-Nino events in the next

decade, and so on.

A common approach to solve this problem is to

construct a single regression model from historical

values of the time series and then apply the model step

by step to estimate its future values. This approach is

known as multi-stage prediction [17]. During the

course of making its predictions, multi-stage prediction

uses the predicted value of the current time step to

determine its value in the next time step. As a result,

we expect such an approach to be susceptible to the

error accumulation problem, i.e., errors committed in

the past being propagated into future predictions. In

this paper, the error accumulation problem is studied

using the bias-variance decomposition framework

[23][24].

As part of our empirical study, we also investigate

two alternative approaches for multistep-ahead time

series prediction. The first approach, which is known as

independent value prediction, builds a separate model

for each prediction step . Because the model at each

time step is built independently, this approach is less

susceptible to the error accumulation problem.

However it has other limitations in terms of efficiency

(multiple models have to be constructed), learning

difficulty (learning task becomes more challenging as

prediction step increases), and lack of smoothness in

the predicted time series. Another approach is called

parameter prediction, in which a parametric function is

used to fit each sequence of output values in the

training data. A separate regression model is then built

to learn each parameter of the fitted function. We show

that this method avoids the lack of smoothness and

error accumulation problems. It can also be more

efficient than independent value prediction when the

number of parameters in the fitting function is small.

However, its main difficulty has to do with choosing

the right parametric function to fit the sequence of

output values.

We perform a comparative study among the three

prediction approaches using multiple linear regression

(MLR) [2, 3], recurrent neural networks (RNN) [4],


and a hybrid hidden Markov model with multiple linear

regression (HMM/MLR) [16] as the underlying

regression methods. The advantages and disadvantages

of each prediction approach are analyzed in terms of

their error accumulation, smoothness of prediction,

learning difficulty, and efficiency.

The rest of the paper is organized as follows.

Section 2 formalizes the multistep-ahead time series

prediction problem. Section 3 presents the regression

methods used in our experiments. We then describe the

multi-stage prediction, independent value prediction,

and parameter prediction approaches. Our

methodology for model selection is also described in

this section. We then present our experimental results

in Section 4. We conclude our discussion and raise

several issues for future work in Section 5.

2. Preliminaries

A time series is a sequence of observations in which

each observation tx is recorded at a particular

timestamp t. A time series of length t can be

represented as an ordered sequence ],...,,[ 21 txxxX = .

For brevity, we use the notation t

ptX − to represent the

sequence ],...,,[ 1 tptpt xxx +−− .

Definition 1 [Single-step prediction] Single-step

prediction is the task of predicting 1+tx given tX 1 .

If the future value depends only on its p previous

observations, the single-step prediction problem can be

expressed mathematically as follows:

)( 11

t

ptt Xfx +−+ = (1)

Definition 2 [Multistep-ahead prediction] Multistep-

ahead prediction is the task of predicting a sequence of

h values, ht

tX ++1 , given its p past observations,

t

ptX 1+− :

)( 11

t

pt

ht

t XfX +−++ = (2)

3. Methodology

This section presents the three prediction approaches

investigated in this study. We first describe the

regression methods used in our experiments.

3.1. Regression Methods

We consider three regression methods in this study:

multiple linear regression (MLR), recurrent neural

networks (RNNs), and a hybrid of hidden Markov

Model with multiple linear regression (HMM/MLR).

To simplify the discussion, we describe these methods

in the context of single-step prediction problems.

3.1.1. Multiple Linear Regression (MLR). The

general form of the MLR model, which is also called

the AR model, is given by the following equation:

,1

11 t

p

i

itit xax ε+= ∑=

+−+ (3)

where the current value of the time series is expressed

as a linear combination of its p past values plus a noise

term εt. The noise term is a random variable with zero

mean and variance2

εσ . The coefficient vector

T

paa ],...,[ 1are the parameters to be estimated from

training data by minimizing the sum of squared error,

∑∑+

+=

−−−=

ht

ti

ipii XfxSSE

1

21 ))(( . The variance 2

εσ is

estimated using h

SSE, where h is the size of the

prediction window.

3.1.2. Recurrent Neural Networks (RNN) have

been successfully applied to noisy and non-stationary

time series prediction [10, 15]. In RNN, the temporal

relationship of the time series is explicitly modeled

using feedback connections [4] to the internal nodes

(known as hidden units), as shown in Figure 1.

Figure 1 A simple example of Elman Network

An RNN model can be expressed in the following way:

),( 11 t

t

ptt Xfx ε+−+ = (4)


where f is a non-linear function of the weights in the

network. The regression function f is learnt by

presenting the past values of the time series to the input

layer of the Elman back propagation network [6]. The

weights of the network are then adjusted based on the

error between the true output and the value predicted

by the network until the algorithm converges. A single

presentation of all the training data is called an epoch.

Before the network is trained, we need to specify the

number of hidden units as well as the stopping criteria.

3.1.3. Hybrid HMM/MLR Model is an extension

of traditional hidden Markov model applied to

regression analysis [16]. This method is an effective

way for modeling piecewise stationary time series,

where the observed values are assumed to be generated

by a finite number of hidden states. Let ( tZ ) denote the

Markov chain on the state space ,...,, 21 NsssS = .

The initial probability for a given state s is denoted as

πs while the transition from one state to another is

characterized by the transition matrix )( ijaA = , where

.)|( 1 ijitjt asZsZP ===+ At time t, the observed

value tx depends only on the current state tZ :

),0()( 1

tt z

t

ptzt eXfx σ+= −−

(5)

where ,...,,21 Nt sssz ffff ∈ is the corresponding

regression function, ,...,,21 Nt sssz σσσσ ∈ is the

standard deviation for the regime defined by the state Zt,

and e(0,σs) is a noise term with mean zero and a

variance that depends on the current state, s. We use

MLR to be the regression function for our experiments.

The hybrid HMM/MLR model is trained by

maximizing the following likelihood function:

( )∑ ∏

∑

=

−−+ −Φ=

=

Z

t

i

i

piziiiz

Z

tt

XfXzzP

ZXPXL

i

2

1

1

11

)()|(

);()(

1π

θ

(6)

A brute force method for maximizing the likelihood

function requires a complexity of )( TNO operations

[18]. However, an efficient approach called the

forward-backward procedure can reduce the

complexity of the computation down to )( 2TNO . This

procedure is based on the well-known expectation-

maximization (EM) algorithm [20]. The parameter Θof the model consists of the transition probabilities (aij),

the parameters of the regression functions λ, and the

standard deviations of the states in S. Let Θ* denote the

current estimate of the model parameters.

Figure 2. Model HMM/MLR

Forward variable: Let *)|,()( Θ== it

t

iit sZXPsα be

the joint probability of the state is and the observation

sequence tX1 . This variable is iteratively computed

using the following recurrence formula:

( ))()()( 11

1

1

t

ptsti

N

j

jijtit XfXassi +−+

=+ −Φ⎥

⎦

⎤⎢⎣

⎡= ∑αα (7)

Backward variable: Let *),|()( 1 Θ== + ji

t

iji sZXPsβdenote a backward variable, which is computed using

the following recurrence formula:

∑=

++−+ −Φ=N

k

jkki

i

pisikji asXfxsi

1

111 )())(()( ββ (8)

Let:

∑=

=N

i

itit

ititit

ss

sssw

1

)()(

)()()(

βα

βα

and

∑=

++−+

++−+

−Φ

−Φ=

N

ji

it

t

ptstjijt

it

t

ptstjijt

ijt

sXfxas

sXfxasssw

i

i

1,

111

111

)())(()(

)())(()(),(

βα

βα

The parameters of the model are estimated iteratively

using the forward-backward procedure as follows:

∑∑

=

=Λ

=t

k jk

t

k ijkij

sw

sswa

1

1

)(

),((9)

∑=

−−−=

t

k

k

pkzkiks Xfxswki

1

21 )]()[(minargλ (10)

∑

∑

=

=

−−Λ

−=

t

k

ik

t

k

k

pkzkik

s

sw

Xfxswk

i

1

1

21

2

)(

)]()[(

σ (11)


Finally, let )(iQt denote the estimated probability that

Zt = si, i.e.,

∑=

=N

j

jt

itt

s

siQ

1

)(

)()(

α

α (12)

The probability for the next state can be estimated

using the transition matrix:

tt AQQ =+1 ,

while the predicted value for the next time step is:

∑=

+−++ =N

i

t

ptstt XfiQxt

1

111 )()( (13)

3.2. Prediction Approaches

We investigate three approaches for predicting the

sequence of future valuesht

tX ++1

from a given time series

tX1.For all the regression methods such as MLR, RNN,

and HMM/MLR, multiple subsequences of the time

series are obtained using a sliding window of size p+h.

Each instance of the sliding window corresponds to a

record in the training set D, as shown in Table 1. X’

contains the first p values of the window while Y

contains the remaining h values of the window. For

example, the first record contains X’ = ],...,,[ 21 pxxx

as its input variables (regressors) and Y =

],...,,[ 21 hppp xxx +++ as its output variables (responses).

Similarly, the second record contains X’ =

],...,,[ 132 +pxxx as its input variables and Y =

],...,,[ 132 ++++ hppp xxx as its output variables, and the

last record contains X’ = ],...,,[ 21 htphtpht xxx −+−−+−−

as its input variables and Y = ],...,,[ 21 ththt xxx +−+− as

its output variables. For notational convenience, we use

Y(i) to refer to all the values in the ith column of Y in D.

For example, Y(3) = T

htpp xxx ],...,,[ 343 +−++ .

Table 1.Traning Set D = X’×Y

X’=[X’(1),X’(2),…,X’(p)] Y=[Y(1),Y(2),…,Y(h)]

],...,,[ 21 pxxx ],...,,[ 21 hppp xxx +++

],...,,[ 132 +pxxx ],...,,[ 132 ++++ hppp xxx

… . … .

3.2.1. Multi-stage Prediction aims to predict the

future values in a step by step manner. For example, we

first predict 1+tx using the previous p values,

ttpt xxx ,,..., 11 −−+ . We then predict 2+tx based on its

previous p values, which includes the predicted value

for 1+tx . The procedure is repeated until the last future

value, htx + , has been estimated. This approach is a

direct generalization of the single-step prediction

method.

Given the data set shown in Table 1, we only need

to build a single regression model f for predicting the

first column, Y(1), using the input vector X’ as the

Figure 3. A sliding window is used to create the regression training set D=X’+Y. X’ represents the input values from the first part of the sliding window w1 and Y is a set of target values from the second part of the sliding window w2.


explanatory attributes. Once f is derived, the model is

applied step-by-step as shown below:

),,...,(

.......

),,...,(

),,...,(

12

122

111

−+−+−++

+−++

−−++

=

=

=

hthtphtht

ttptt

ttptt

xxxfx

xxxfx

xxxfx

(14)

Multi-stage prediction models are generally simple

and efficient to build. However, as will be shown later,

it suffers from the error accumulation problem

especially when the prediction period is long.

3.2.2. Independent Value Prediction aims to

predict the future values separately using multiple

regression models. Given the data set shown in Table 1,

we first create h training sets. All h training sets have

the same input variables X as shown in Table 1, but

have different output variables Y. We use Y(1) as the

output variable for the first training set, Y(2) as the

output variable for the second training set, and so on.

By learning each training set independently, we obtain

h regression models hifi ,...,2,1, = :

hiXfiY i ,...,2,1),'()( == (15)

The derived models are then used to predict the next h

values in the following way:

hiXfx iit ,...,2,1),'( ==+ (16)

where f1(X’) is used to predict xt+1, f2(X’) is used to

predict xt+2, and so on. We will illustrate some of the

potential drawbacks of using this approach in Section 4,

in the context of learning difficulty, efficiency, and

smoothness of prediction.

3.2.3. Parameter Prediction transforms the

problem of predicting h output values into an

equivalent problem of predicting (d+1) parameters. To

illustrate this approach, consider the data shown in

Table 1. For each record in Table 1, we fit a parametric

function g to the output vector Y. Let (c0, c1,…, cd)

denote the parameters of the function g. We then

replace the original output vector Y=[Y(1),Y(2),…,Y(h)]

with a modified output vector Y’=[ c0, c1, …, cd]. As a

result, there are (d+1) output columns in Y’. We now

construct (d+1) regression functions difi ,...,2,1,0, = using

the same approach as the independent value prediction

method described in the previous section.

diXfc ii ,...,2,1,0),'( == (17)

Once the models have been constructed, we can apply

them to predict the (d+1) parameters for a test record.

The predicted parameters are then used to reconstruct

the time series values by substituting the parameters

into the parametric function g. Although this

methodology is generally applicable to any family of

parametric functions, we use polynomial functions for

our experiments.

This method is an improvement over the

independent value prediction. Although it still predicts

the parameters independently, the smoothness property

of the time series is ensured by the fitting function g. It

is also more efficient than independent value prediction

when the number of parameters needed to fit the

subsequence is small.

The models f and if described in (14), (16) and

(17) can be used for MLR and RNN. However, the

situation is somewhat different for HMM/MLR. In

HMM/MLR, each state has a corresponding regression

model. As a result, the number of models to be trained

in HMM/MLR is N for multi-stage prediction, N×h for

independent value prediction and N×(d+1) for

parameter prediction, where N is the total number of

states.

3.3. Model Selection

The parameters for our prediction approaches include

the order of regression model p for MLR, the size of

prediction window h, and the degree of polynomial fit

d for parameter prediction. The size of the prediction

window is usually specified by the user based on

domain knowledge or application requirement.

Final prediction error (FPE) is a measure that can

be used to determine the right order for p in the MLR

model. The FPE criterion was introduced by Akaike

[21] to select the appropriate order of an AR model to

fit a time series. The idea is to choose p in such a way

that the single-step mean squared error is minimized

when the model is used to predict an independent

realization, Y, of the same process that generates the

time series X.

pt

ptFPE

−+

=2^

δ (18)

where

hpt

yyhpt

j

jj

−−

−

=∑

−−

=

2

1

1

^

12^)(

δ (19)

For parameter prediction, the degree of the

polynomial fit d can also be determined in the same

manner by representing the input of the fitting function

as htttt d,...,2,1],,...,,,1[

2 = .


To determine the correct order for RNNs, we

employ the method described by Kennel in [7], which

uses an incremental search based on the false nearest

neighbor heuristic to determine a minimal value for p.

Let pX denote as an instance of the training data and

)(n

pX denote its corresponding nearest neighbor. The

pair is declared as false nearest neighbors if

)(

)()()(

)(

11

)(

n

pp

n

pp

n

pp

XXd

XXdXXd

−

−−− ++ exceeds an absolute

value, which we have chosen to be equal to 10. In this

formula, d refers to the distance between a pair of

observations. Our objective for model selection is to

choose a minimum value for p such that the number of

false nearest neighbors is close to zero.

4. Experiments and Discussions

In this section, we analyze the strengths and

weaknesses of the three prediction approaches using

both real and synthetic datasets. The real datasets are

obtained from the UCI Machine Learning Repository

[22] and the Time Series Data Library [14].

Experiments are conducted on a Pentium 4 machine

with 3GHz CPU and 1GB of RAM.

4.1. Evaluation Metric

The estimation error of a prediction approach is

evaluated based on the RMSE measure:

∑

∑

=

−

=

−

−

=h

i

ii

h

i

ii

yy

yy

RMSE

1

2

2

1

^

)(

)(

(20)

where ^

iy is the predicted value of iy and −

iy is the

average value of the time series. Intuitively, RMSE

determines how well the prediction approach compares

to making a guess based on the average time series

value. The RMSE values recorded in our experimental

results are obtained using ten-fold cross validation .

A Win-Draw-Loss Table can also be created to

compare the relative performance between two

prediction approaches when applied to n data sets. We

use the criterion of 0.01 difference in RMSE to

determine whether one approach wins or loses against

another approach. For a stricter evaluation, we apply

the paired t significance test to determine whether the

observed difference in RMSE is statistically significant.

To do this, we first calculate the difference (d) in the

RMSE obtained from two prediction approaches on

each data set. The mean d and standard deviation ds of

the observed differences are also calculated. To

determine whether the difference is significant, we

compute their T-statistic:

ns

dt

d /= , (21)

which follows a t-distribution with n-1 degrees of

freedom. Under the null hypothesis that the two

prediction approaches are comparable in performance,

we expect the value of t should be close to zero. From

the computed value for t, we estimate the p-value of the

difference, which corresponds to the probability of

rejecting the null hypothesis. We say the difference in

performance is statistically significant if p<0.05 and

highly statistically significant if p < 0.001.

4.2. Error Accumulation

The purpose of this experiment is to study the

effect of error accumulation when making a sequence

of predictions using the approaches described in

Section 3.2. We define error accumulation as the

propagation of prediction errors from the past into

future predictions. To gain a better insight into the

error accumulation problem, we employ the bias-

variance decomposition for squared loss functions [24].

Consider a time series generated by the model

),0()( 2

1 σeXfx t

ptt += −+. Let ( )hyyy ,,, 21 L denote

the true values of the time series for a prediction

window of length h, i.e., y1 = xt+1, y2 = xt+2, …, yh =

xt+h. Furthermore, let ( )**2

*1 ,,, hyyy L be the

corresponding values generated by the true model f. In

other words, yi = yi* + e(0,σ2

). We use the notation

( )hvvv ,,, 21 L to denote the values predicted by a

regression model, g.

We define the mean squared error (MSE) at each

prediction step j as follows:

])[()( 2

jj vyEjMSE −= (22)

The MSE at each step can be further decomposed into

the following three components [23]:

(noise)])[(

(variance)]))([(

)(bias))(()(

2*

2

22*

jj

jj

jj

yyE

vEvE

yvEjMSE

−+

−+

−=(23)

The first term represents the squared bias (or simply,

bias) of the model, the second term represents the

variance of the model, while the third term corresponds

to the inherent variability due to noise,2

jσ . We


occasionally use the notation jv to denote E(vj). The

next example illustrates the propagation of errors due

to the noise term.

Example 1: Consider the following AR(2) model:

ε++= −+ 1211 ttt xaxax , where ε has mean zero and

variance σ2. Suppose we employ the MLR regression

method to learn the dynamics of the time series.

Assume that the number of training examples is large

enough to enable the MLR to accurately estimate the

coefficients a1 and a2. Furthermore, we ignore the bias

and variance of the model. For the multi-stage

approach:

( )[ ]

( )

( ) .1)3(

)(

)(

.1)2(

)(

)(

.)()1(

22

2

2

1

4

1

312211

2

13

312211

2

11221

312213

22

1

2112

211211

22112

22

1

2

11

1111211

σ

εεεε

εεεε

ε

σ

εε

εε

ε

σε

εε

+++∝∴

++++=

+++++=

++=

+∝∴

++=

+++=

++=

∝=−=∴

+=++= −

aaaMSE

aaav

aaavava

yayay

aMSE

av

axava

xayay

EvyEMSE

vxaxay

t

t

tt

The above formula shows how the errors due to noise

grow for multi-stage prediction as the prediction step

increases. For independent value prediction, it can be

shown that:

( ) ( )( ) ( )

)(

2

)(

312211

2

1

12

2

1

2

221

3

13

2111212

2

12

11211

εεεε

εε

ε

++++

+++=

++++=

++=

−

−

−

aaa

xaaaxaaay

axaaxaay

xaxay

tt

tt

tt

If the coefficients for xt and xt-1 can be estimated

accurately by the MLR for all the prediction steps, then

the MSE for multi-stage and independent value

prediction approaches become identical. Similarly, for

parameter prediction, if the number of parameters is

the same as the size of the prediction window, then

both independent value and parameter prediction

approaches would be equivalent and we obtain the

same MSE as multi-stage prediction.

The preceding example illustrates how errors due

to the noise term are propagated in the time series.

Such errors are unavoidable, irrespective of the

prediction approach (multi-stage, independent value or

parameter prediction).

We next consider the propagation of errors due to

the bias and variance of a model. To do this, we

generate the following time series:

ε++= −− 21 6340418.0 ttt X.XX (24)

where ε is a Gaussian noise with mean zero and

variance σ2=0.1. We set the length of the time series to

1000 and the prediction window to h = 50. Furthermore,

to ensure there is bias in the model, we set p = 1. We

use the bootstrap approach to measure the bias and

variance of the inducted models. More specifically,

given an initial training set D (as shown in Table 1), we

perform sampling with replacement to obtain a new

training set 'D . A regression model g is then induced

from the bootstrap replicate. This procedure is repeated

500 times to obtain an ensemble of 500 models. We

then apply the models to the test sequence to obtain

500 estimated values (jv ) for each prediction step j. To

compute the empirical bias, we take the average of the

500 predictions (jv ) and subtract it from the value

predicted using the ground truth model (Equation 24

without the noise term). The variance of the models is

estimated as follows:

]))[()var( 2

jj vvEj −=

Figure 4 shows the bias and variance of the models

for the multi-stage and independent value prediction

approaches using the synthetic data set generated by

Equation (24). The figure clearly shows that the bias

and variance for multi-stage prediction grows steadily

with increasing time steps unlike the independent

prediction approach. The bias and variance of the latter

approach are considerably smaller and do not appear to

be propagated into future predictions.

Figure 4: Bias and Variance for MLR


Figure 5: Bias and Variance for HMM/MLR

A similar conclusion can be reached using HMM/MLR

as the underlying regression method. The results for

HMM/MLR are shown in Figure 5. In section 3.1.3, we

have shown that the prediction of an HMM/MLR

model is given by )()( 1

1

11

t

pts

N

i

tt XfiQXt +−

=++ ∑= .When

multi-stage prediction is used, the argument for f

includes some of the earlier predicted values for X, thus

leading to the bias and variance accumulation problem.

Note that for some data sets, the bias and variance

curves for independent value prediction may behave

quite erratically. For example, Figure 6 shows the bias

and variance for RNN when trained on a time series

generated according to the following recursive

formular:

)1.0,0()2/sin( 1 επ +∗= −tt xx .

The erratic behavior arises because the regression

method has not learned the correct parameters of the

complex models.

Figure 6: Bias and Variance for RNNS

In short, our study shows that error accumulation (in

bias and variance) is a major problem in multi-stage

prediction. This problem occurs irrespective of the

choice of regression method used to fit the data.

4.3. Learning Difficulty

For multi-stage prediction, we only need to build a

single regression model to fit the entire time series. On

the other hand, we need to build h regression models

for independent value prediction and (d+1) regression

models for parameter prediction. Model building is

therefore less expensive for multi-stage approach

compared to the independent value and parameter

prediction approaches.

Even if the true model is simple, the function to be

learnt by independent value prediction become s

increasingly complex with increasing time step. We

illustrate this problem by considering a prediction task

that uses p past values to predict h future values. Let f

denote the true model that generates the data, i.e.,

)( 1−−= t

ptt XfX . For simplicity, let (x1, x2, …, xp)

denote the predictor variables and yi = xp+i denote the h

output variables:

),...,,(

),(...,),(...,

),...,,(

),,...(),,...(

),...,,(),...,,(

21

1212

212

12122

211211

ph

hhhhh

p

pp

pp

xxxf

fffyyfy

xxxf

fxxfyxxfy

xxxfxxxfy

=

==

=

==

==

−−−−

M

If f is a linear function, it can be shown that all the

fk’s (k=1,2,…,h) constructed by the independent

prediction method are also linear functions. However,


if f is non-linear, then the fk’s become increasingly

complex functions of the original predictors (x1, x2, …,

xp). In other words, unless the regression model is very

flexible, learning the appropriate model for each time

step can be a very challenging task.

For parameter prediction, the learning difficulty

depends on how well the parametric function fits the

output vector Y. If the function is very flexible (i.e.,

requires a large number of parameters), then we need a

wide enough prediction window to allow the

parameters of the function to be accurately estimated.

Otherwise, the estimated parameters may overfit the

training data, which in turn, leads to poor prediction

results. If we use polynomial functions to fit the output

vector and the number of parameters is the same as the

number of time steps to be predicted, then parameter

prediction is equivalent to independent value prediction.

If the parametric function is too simple, then the

prediction result may still be poor because the function

may underfit the training data.

To compare parameter prediction against

independent value prediction, we apply both methods

to the monthly milk production data (see the top

diagram in Figure 7) to predict the next twelve month

production (h=12) using its past monthly values

(p=12). For the parameter prediction approach, we use

a polynomial function to fit the output vector and vary

the degree of the polynomial from 0 to 11. We then

employ MLR to predict the parameters. The bottom

diagram of Figure 7 shows a comparison between the

RMSE of parameter prediction against independent

prediction as the degree of the polynomial function is

varied. Observe that the RMSE for parameter

prediction drops dramatically after the first three

iterations and decreases slowly afterwards. This result

suggests that it is sufficient to build 5 regression

models to fit a polynomial of degree 4 to the output

vector and achieves quite comparable predictive

accuracy as independent value prediction (which must

build 12 independent regression models).

Figure 7. Prediction Results (p=12, h=12)

4.4. Smoothness of Prediction

Another effect to consider is the influence of

noise on the prediction methods. To do this, we

conduct an experiment using a simple, stationary time

series, i.e., w hite noise, as shown in Figure 8. As shown

in Figure 9, multi-stage prediction tends to smooth out

the prediction to the mean value of the noise time series

after a certain period of time (specifically, the order p).

Such smoothing effect is not present in independent

value prediction, which predicts spurious values

fluctuating around the mean, because the prediction at

each time step is made independently. In fact, this

method may suffer from overfitting as it tries to capture

the fluctuations in the noise time series. For parameter

prediction, the best fit model of the data is a

polynomial of degree zero. Even though the parameters

are predicted independently, the smoothness of the time

series is guaranteed by the parametric function used to

fit the output vector.

Figure 8. White Noise WN(0,0.5)

Figure 9. Predicting Results (p=12,h=100,d=6)

4.5. A General Comparison

Finally, we apply the three prediction approaches

to 21 real data sets to compare their relative

performance. The RMSE value for each data set is


obtained by 10-fold cross validation. The size of the

prediction window is set to h=24. The model

parameters for MLP and RNNs are selected using the

methods described in section 3.3.

Table 2 shows the RMSE for the three prediction

approaches using MLR as the underlying regression

method. Their relative performance is summarized in

Table 3 in terms of the number of wins, draws and

losses. We also test the significance of the difference

using paired t-significance test. As the result shows, the

observed difference between the RMSE of multi-stage

and independent value prediction is not that significant.

However, the performance of parameter prediction is

significantly worse than independent value prediction.

This is because MLR may not be suitable to fit the

parameters of the function, which have nonlinear

relationships with the time series values.

Table 2.Multiple Linear Regression

Table 3.Win-Draws-Loss & Paired t-test for MLP

Multi-stage

vs

Independent

Multi-stage

vs

Parameter

Independent

vs

Parameter

0.01 diff 10-6-5 14-2-5 8-12-1

t value 0.1496 -0.8761 -2.7299

P value 0.8826 0.3914 0.0129

Table 4 shows the RMSE for the three prediction

approaches using RNNs. The results are summarized in

Table 5 in terms of the number of wins-draws-losses

and paired t-significance test. The results with RNNs

suggest that the independent value and parameter

prediction approaches perform significantly better than

multi-stage prediction at p < 0.05. For multi-stage

prediction, the RMSE for RNNs is higher than the

RMSE of MLR in 10 out of 21 data sets. This result

suggests the possibility of model overfitting in some of

these data sets when using a flexible regression method

such as RNNs. Nevertheless, we can still find 17 data

sets in which independent prediction with RNNs

outperforms all the prediction approaches using MLR

and 12 data sets in which parameter prediction with

RNNs outperforms all the prediction approaches using

MLR. This result suggests that for nonlinear regression

methods such as RNNs, the independent value and

parameter prediction can achieve better performance

than multi-stage prediction. Moreover, for parameter

prediction, most of the data sets require d < 5 (using

the methodology described in Section 4.4), which

demonstrates the efficiency of parameter prediction

compared to independent value prediction (which

requires building h = 24 models).

Table 4.RNNsMulti-stage independent Parameter

milk 0.1453 0.1075 0.1146

Temp. 0.3820 0.2215 0.3073

PET 0.0572 0.0308 0.0842

PREC 0.0869 0.0301 0.0527

Solar 0.1804 0.1626 0.1722

appb 0.2932 0.1915 0.1565

appd 0.3152 0.2892 0.2976

appf 0.3221 0.2100 0.2570

appg 0.8959 0.7006 0.8178

deaths 0.8130 0.6762 0.7723

lead 0.3958 0.4190 0.4032

sales 0.2357 0.2732 0.2569

wine 0.3172 0.2731 0.2876

seriesc 0.8332 0.9165 0.7001

odonovan 0.3738 0.4282 0.2539

qbirth 0.5425 0.4718 0.5513

Bond2 0.4056 0.4663 0.4355

Dailysap 0.1988 0.1825 0.1813

food 0.1936 0.1782 0.1862

treering 0.8486 0.8248 0.8427

pork 0.8688 0.7517 0.7828

Multi-stage Independent Parameter

milk 0.0733 0.0705 0.0918

Temp. 0.2776 0.2959 0.2936

PET 0.0419 0.0414 0.0619

PREC 0.0310 0.0317 0.0534

Solar 0.1218 0.1236 0.2132

appb 0.2974 0.3804 0.3803

appd 0.2152 0.2395 0.2766

appf 0.3147 0.2445 0.2991

appg 0.8642 0.9343 0.9218

deaths 0.7309 0.5560 0.5633

lead 0.4195 0.4207 0.4206

sales 0.3187 0.3637 0.3637

wine 0.2738 0.2902 0.3209

seriesc 0.9359 0.9845 0.9845

odonovan 0.4226 0.4731 0.4712

qbirth 0.5450 0.4793 0.5231

Bond2 0.5226 0.5886 0.5884

Dailysap 0.2006 0.2137 0.2137

food 0.1995 0.1929 0.1950

treering 0.8929 0.8807 0.8818

pork 0.9462 0.7948 0.7918


Table 5.Win-Draws-Loss & Paired t-test for RNNs

Multi-stage

vs

Independent

Multi-stage

vs

Parameter

Independent

vs

Parameter

0.01 diff 7-0-14 5-2-14 13-1-7

t value 2.6396 3.3884 -0.3012

P value 0.0157 0.0029 0.7664

5. Conclusions and Future Work

In this paper, we conduct an empirical study on

three prediction approaches for solving the multistep-

ahead time series prediction problem. The advantages

and disadvantages of these approaches are studied

using real and synthesis data sets. Using the bias-

variance decomposition framework, our experimental

results show that multi-stage prediction tends to suffer

from the error accumulation problem especially when

the prediction window is long. This is because the bias

and variance in previous time steps are propagated into

future predictions. Independent value prediction is less

susceptible to this problem because its predictions are

made independently at each time step. However, it has

difficulty in learning the appropriate function when the

prediction window is large because the true function

becomes more complex with increasing prediction time

steps. This approach also does not smooth out the

effect of noise unlike multi-stage prediction. Parameter

prediction smooths the effect of noise by fitting a

function over the entire output sequence and avoids the

error accumulation problem by making independent

predictions. It also tends to be more efficient than

independent value prediction when the parameter set is

small. However, finding the appropriate parameter

function to fit the output values can still be quite a

challenging task. We observe successful applications of

both independent value and parameter prediction

approaches when applied to real data sets using RNNs.

For future work, we aim to investigate how the

prediction approaches depend on other time series

properties such as temporal auto-correlation and

develop theoretical analyses to explain some of the

findings observed in this paper.

6. References

[1] Gershenfeld N. A. and Weigend A. S., “The Future of

Time Series.” In “Time Series Prediction: Forecasting

the Future and Understanding the Past”, pp 1-70

(1993).

[2] Jones R. H. “Maximum like lihood fitting of ARMA

models to time series with missing observations.”

Technometrics 20, pp.389–395 (1980).

[3] Harvey A. C. and McKenzie C. R. “Algorithm AS182.

An algorithm for finite sample prediction from ARIMA

processes”. Applied Statistics 31, 180–187 (1982).

[4] Giles C.L., Lawrence S. and Tsoi A.C., “Noisy Time

Series Prediction using a Recurrent Neural Network

and Grammatical Inference,” Machine Learning, 44(1-

2), pp.161-183.(2001).

[5] Benjamin Kedem and Konstantinos Fokianos

“Regression Models for Time Series Analysis”, Wiley

(2002).

[6] Elman J.L. “Distributed Representations, Simple

Recurrent Networks, and Grammatical Structure.”

Machine Learning, 7 (2/3), pp.195–226 (1991).

[7] Kennel M. B., Brown R., and Abarbanel H. D. I.,

“Determining embedding dimension for phase-space

reconstruction using a geometrical construction”, Phys.

Rev. A 45, 3403 (1992).

[8] Thomas P.Minka “Bayesian linear regression”

MIT Media Lab note 2001, available at

http://research.microsoft.com/~minka/papers/linear.htm

l

[9] Cleeremans A., Servan-Schreiber D., and McClelland.

J.L. “Finite state automata and simple recurrent

networks.” Neural Computation, 1(3), pp.372–381

(1989).

[10] C. Lee Giles, Miller C.B., Chen D., Chen H.H., Sun

G.Z. and Lee Y.C. “Learning and extracting finite state

automata with second-order recurrent neural networks.”

Neural Computation, 4(3), pp.393–405 (1992).

[11] Harvey A. C., “Time series models”, Philip Allan

(1981).

[12] Vandaele W., “Applied time series and Box-Jenkins

models”, Academic Press (1983).

[13] Box G. E. P., Jenkins G. M., “Time series analysis:

forecasting and control, Holden-Day” (1976).

[14] Hyndman R., Time Series Data Library, http://www-

personal.buseco.monash.edu.au/~hyndman/TSDL/

[15] Edwards T., Tansley D.S.W., Davey N. and Frank R.J.

“Traffic Trends Analysis using Neural Networks.” Proc.

of the International Workshop on Applications of

Neural Networks to Telecommunications 3, pp157-164

(1997).


[16] Joseph Rynkiewicz “Hybrid HMM/MLP models for

time series prediction.” Proc of the. European

Symposium on Artificial Neural Networks Brugges,

Belgium, pp. 455-462 (1999).

[17] Chen Rong, Yang Lijian and Hafner Christian,

“Nonparametric multistep-ahead prediction in time

series analysis” Journal of the Royal Statistical Society

Series B, 66 (3), pp 669 (2004).

[18] Elliott, R., Aggoun, L., Moore J. “Hidden Markov

models : estimation and control”, Springer (1997).

[19] Rabiner L.R. “A tutorial on hidden Markov models and

selected application in speech application.” Proc. of the

IEEE, volume 77:257-287. Pergamon (1989).

[20] Hartley H. “Maximum likelihood estimation from

incomplete data.” Biometrics, 14:174–194. (1958)

[21] Akaike H. “Fitting autoregressive models for

prediction.” Annals of the Institute of Statistical

Mathematics, 21, 243 – 247.(1969).

[22] UCI Machine Learning Repository

http://www.ics.uci.edu/~mlearn/MLRepository.html

[23] Y. Le Borgne.” Bias variance trade-off characterization

in a classification. What differences with regression?.”

Technical Report N°534, ULB, January (2005).

[24] Geman S., Bienenstock E. and Doursat R., “Neural

networks and the bias/variance dilemma”, Neural

Computations, 4, 1-48(1992).


A PCA-based Kernel for Kernel PCA on Multivariate Time Series

Kiyoung Yang and Cyrus ShahabiComputer Science Department

University of Southern CaliforniaLos Angeles, CA 90089-0781[kiyoungy,shahabi]@usc.edu

Abstract

Multivariate time series (MTS) data sets are common invarious multimedia, medical and financial application do-mains. These applications perform several data-analysisoperations on large number of MTS data sets such assimilarity searches, feature-subset-selection, clustering andclassification. Inherently, an MTS item has a large num-ber of dimensions. Hence, before applying data miningtechniques, some form of dimension reduction, e.g., fea-ture extraction, should be performed. Principal ComponentAnalysis (PCA) is one of the techniques that have been fre-quently utilized for dimension reduction. However, tradi-tional PCA does not scale well in terms of dimensionality,and therefore may not be applied to MTS data sets. TheKernel PCA technique addresses this problem of scalabil-ity by utilizing the kernel trick. In this paper, we propose aPCA based kernel to be employed for the Kernel PCA tech-nique on the MTS data sets, termed KEros, which is basedon Eros, a PCA based similarity measure for MTS data sets.We evaluate the performance of KEros using Support VectorMachine (SVM), and compare the performance with KernelPCA using linear kernel and Generalized Principal Com-ponent (GPCA). The experimental results show that KErosoutperforms these other techniques in terms of classificationaccuracy.

1 INTRODUCTION

A time series is a series of observations, xi(t); [i =1, · · · , n; t = 1, · · · , m], made sequentially through timewhere i indexes the measurements made at each time pointt [20]. It is called a univariate time series (UTS) when nis equal to 1, and a multivariate time series (MTS) when nis equal to, or greater than 2. A UTS data is usually repre-sented in a vector of size m, while each MTS item is typ-ically stored in an m × n matrix, where m is the numberof observations and n is the number of variables (e.g., sen-

sors).MTS data sets are common in various fields, such as in

multimedia, medicine and finance. For example, in multi-media, Cybergloves used in the Human and Computer In-terface (HCI) applications have around 20 sensors, eachof which generates 50∼100 values in a second [11, 19].For gesture recognition and video sequence matching usingcomputer vision, several features are extracted from eachimage continuously, which renders them MTSs [5, 2, 17].In medicine, Electro Encephalogram (EEG) from 64 elec-trodes placed on the scalp are measured to examine the cor-relation of genetic predisposition to alcoholism [26]. Func-tional Magnetic Resonance Imaging (fMRI) from 696 vox-els out of 4391 has been used to detect similarities in acti-vation between voxels in [7].

An MTS item is typically very high dimensional. Forexample, an MTS item from one of the data sets used in theexperiments in Section 4 contains 3000 observations with64 variables. If a traditional distance metric for similaritysearch, e.g., Euclidean Distance, is to be utilized, this MTSitem would be considered as a 192000 (3000 × 64) dimen-sional data. 192000 dimensional data would be overwhelm-ing not only for the distance metric, but also for indexingtechniques. To the best of our knowledge, there has been noattempt to index data sets with more than 100000 dimen-sions/features1. Hence, it would be necessary to preprocessthe MTS data sets and reduce the dimension of each MTSitem before performing any data mining tasks.

A popular method for dimension reduction is PrincipalComponent Analysis (PCA) [10]. Intuitively, PCA firstfinds the direction where the variance is maximized andthen projects the data on to that direction. However, tra-ditional PCA cannot be applied to MTS data sets as is,since each MTS item is represented in a matrix, while forPCA each item should be represented as a vector. Thougheach MTS item may be vectorized by concatenating its UTSitems (i.e., the columns of MTS), this would result in the

1In [3], the authors employed MVP-tree to index 65536 dimensionalgray-level MRI images.


loss of the correlation information among the UTS items.In order to overcome this limitation of PCA, i.e., the datashould be in the form of a vector, Generalized PrincipalComponent Analysis (GPCA) is proposed in [24]. GPCAworks on the matrices and reduces the number of rows andcolumns simultaneously by projecting a matrix into a vectorspace that is the tensor product of two lower dimensionalvector spaces. Hence the data do not need to be vectorized.While GPCA reduces the dimension of an MTS item, the re-duced form is still a matrix. In order to be fed into such datamining techniques as Support Vector Machine (SVM) [21],the reduced form still needs to be vectorized, which wouldbe a whole new challenge.

Moreover, for traditional PCA, even if each MTS itemis vectorized, the space complexity of PCA would be over-whelming. For example, assume an MTS item with 3000observations and 64 variables. If this MTS item is vector-ized by concatenating each column end to end, the lengthof the vector would be 192000. Assume that there are 378items in the data set. Then the whole data set would berepresented in a matrix of size 378 × 192000. Though, intheory, the space complexity of PCA is O(nN) where n isthe number of features/dimensions and N is the number ofitems in the data set [14], Matlab fails to perform PCA ona matrix of size 378 × 192000 on a machine with 3GB ofmain memory due to lack of memory. In [8], it has alsobeen observed that PCA does not scale well as a function ofdimensionality.

In [18], the authors proposed an extension of PCA usingkernel methods, termed Kernel PCA. Intuitively, what theKernel PCA does is firstly to transform the data into a highdimensional feature space using a possibly non-linear ker-nel function, and then to perform PCA in the high dimen-sional feature space. In effect, Kernel PCA computes thepair-wise distance/similarity matrix of size N × N using akernel function, where N is the number of items. This ma-trix is called a Kernel Matrix. Kernel PCA therefore scaleswell in terms of dimensionality of the data, since the ker-nel function can be efficiently computed using the KernelTrick [1]. Depending on the kernel employed, Kernel PCAhas been shown to yield better classification accuracy us-ing the principal components in feature space than in inputspace. However, it is an open question which kernel is to beused [18].

In this paper, we propose to utilize Eros [22] for theKernel PCA technique, termed KEros. By using Eros, thedata need not be transformed into a vector for computingthe similarity between two MTSs, which enables us to cap-ture the correlation information among the UTSs in an MTSitem. In addition, by using the kernel technique, the scala-bility problem in terms of dimensionality is resolved. KErosfirstly computes a matrix that contains pair-wise similari-ties between MTSs using Eros, and utilizes this matrix as

Symbol DefinitionA an m × n matrix representing an MTS itemA

T the transpose of A

MA the covariance matrix of size n × n for A

VA the right eigenvector matrix of size n × n for MA

VA = [ a1, a2, · · · , an ]

ΣA an n × n diagonal matrix that has all theeigenvalues for MA obtained by SVD

ai a column orthonormal eigenvector of size n for VA

aij jth value of ai, i.e.,a value at the ith column and the jth row of A

a∗j all the values at the jth row of A

w a weight vector of size nr

i=1wi = 1, ∀i wi ≥ 0

Table 1. Notations used in this paper

the Kernel Matrix for the kernel PCA technique. ThoughEros cannot readily be formulated in terms of dot product asother kernel functions, it has been shown that any positivesemi-definite (PSD) matrices whose eigenvalues are non-negative can be utilized for kernel techniques [13]. If thematrix obtained by using Eros is shown to be positive semi-definite, it can be utilized as a Kernel Matrix as is; other-wise, the matrix can be transformed into a Kernel Matrix byutilizing one of the techniques proposed in [16]. In this pa-per, we utilize the first naıve approach. We plan to comparedifferent approaches to transforming a non-PSD matrix intoa PSD matrix for Eros in the future.

In order to evaluate the effectiveness of the proposed ap-proach, we conducted experiments on two real-world datasets: AUSLAN [11] obtained from UCI KDD repository [9]and the Brain Computer Interface (BCI) data set [12]. Afterperforming dimension reduction using KEros, we comparedthe classification accuracy with other techniques, such asGeneralized Principal Component Analysis (GPCA) [24],and Kernel PCA using linear kernel. The experimental re-sults show that KEros outperforms other techniques by upto 60% in terms of classification accuracy.

The remainder of this paper is organized as follows. Sec-tion 2 discusses the background of our proposed approach.Our proposed approach is presented in Section 3, which isfollowed by the experiments and results in Section 4. Con-clusions and future work are presented in Section 5.

2 BACKGROUND

Our proposed approach is based on Principal ComponentAnalysis (PCA) and our similarity measure for MultivariateTime Series (MTS), termed Eros. In this section, we brieflydescribe PCA and Eros. For details, please refer to [10, 22].For notations used in the remainder of this paper, pleaserefer to Table 1.

2


X1

X2

a1

ß1

PC1 = (cos a

1)X

1 + (cos ß

1)X

2

PC2Score

(Orthogonal Projection)

Figure 1. Two principal components obtainedfor one multivariate data with two variables x1

and x2 measured on 30 observations.

2.1 Principal Component Analysis

Principal Component Analysis (PCA) has been widelyused for multivariate data analysis and dimension reduc-tion [10]. Intuitively, PCA is a process to identify the direc-tions, i.e., principal components (PCs), where the variancesof scores (orthogonal projections of data points onto the di-rections) are maximized and the residual errors are mini-mized assuming the least square distance. These directions,in non-increasing order, explain the variations underlyingoriginal data points; the first principal component describesthe maximum variation, the subsequent direction explainsthe next maximum variance and so on.

Figure 1 illustrates principal components obtained on avery simple (though unrealistic) multivariate data with onlytwo variables (x1, x2) measured on 30 observations. Geo-metrically, the principal component is a linear transforma-tion of original variables and the coefficients defining thistransformation are called loadings. For example, the firstprincipal component (PC1) in Figure 1 can be described asa linear combination of original variables x1 and x2, andthe two coefficients (loadings) defining PC1 are the cosinesof the angles between PC1 and variables x1 and x2, respec-tively. The loadings are thus interpreted as the contributionsor weights on determining the directions.

The central idea of principal component analysis (PCA)is to reduce the dimensionality of a data set consisting ofa large number of interrelated variables, while retaining asmuch as possible the variation present in the data set [10].This is achieved by transforming to a new set of variables,the principal components (PCs), which are uncorrelated,and which are ordered so that the first few retain most ofthe variation present in all of the original variables.

In practice, PCA is performed by applying SingularValue Decomposition (SVD) to either a covariance matrixor a correlation matrix of an MTS item depending on thedata set. That is, when a covariance matrix A is decom-posed by SVD, i.e., A = UΛUT , a matrix U containsthe variables’ loadings for the principal components, anda matrix Λ has the corresponding variances along the diag-onal [10].

2.2 Eros

In [22], we proposed Eros as a similarity measure formultivariate time series. Intuitively, Eros computes the sim-ilarity between two matrices using the principal compo-nents (PCs), i.e., the eigenvectors of either the covarianceor the correlation coefficient matrices, and the eigenvaluesas weights. The weights are aggregated from the eigenval-ues of all the MTS items in the database. Hence, the weightschange whenever data are inserted into or removed from thedatabase.

Definition 1 Eros (Extended Frobenius norm). Let A andB be two MTS items of size mA × n and mB × n, respec-tively2. Let VA and VB be two right eigenvector matri-ces obtained by applying SVD to the covariance matrices,MA and MB, respectively. Let VA = [a1, · · · , an] andVB = [b1, · · · , bn], where ai and bi are column orthonor-mal vectors of size n. The Eros similarity of A and B isthen defined as

Eros(A,B,w)=n

i=1wi|<ai,bi>|=

n

i=1wi| cos θi| (1)

where < ai, bi > is the inner product of ai and bi, w is aweight vector which is based on the eigenvalues of the MTSdata set, n

i=1 wi = 1 and cos θi is the angle between ai

and3 bi. The range of Eros is between 0 and 1, with 1 beingthe most similar.

Intuitively, each wi in the weight vector represents theaggregated variance for all the ith principal components.The weights are then normalized so that n

i=1 wi = 1. Theeigenvalues obtained from all the MTS items in the databaseare aggregated into one weight vector as in Algorithms 1 or2. Algorithm 1 computes the weight vector w based on thedistribution of raw eigenvalues, while Algorithm 2 first nor-malizes each si, and then calls Algorithm 1. Function f()in Line 3 of Algorithm 1 is an aggregating function, e.g.,min, mean and max.

Note that PCA, on which Eros is based, may be de-scribed as firstly representing each MTS item using either

2MTS items have the same number of columns (e.g., sensors), but mayhave different number of rows (e.g., time samples).

3For simplicity, it is assumed that the covariance matrices are of fullrank. In general, the summations in Equation (1) should be from 1 tomin(rA, rB), where rA is the rank of MA and rB the rank of MB.

3


Algorithm 1 Computing a weight vector w based on thedistribution of raw eigenvalues

1: function computeWeightRaw(S)Require: an n×N matrix S, where n is the number of vari-

ables for the dataset and N is the number of MTS itemsin the dataset. Each column vector si in S represents allthe eigenvalues for ith MTS item in the dataset. sij is avalue at column i and row j in S. s

∗i is ith row in S. si∗

is ith column, i.e, si.2: for i=1 to n do3: wi ← f(s

∗i);4: end for5: for i=1 to n do6: wi ← wi/

n

j=1 wj ;7: end for

Algorithm 2 Computing a weight vector w based on thedistribution of normalized eigenvalues

1: function computeWeightRatio(S)Require: the same as Algorithm 1.

2: for i=1 to N do3: si ← si/

n

j=1 sij ;4: end for5: computeWeightRaw(S);

covariance or correlation coefficients, and then performingSVD on the matrix that contains the coefficients. In orderto stably represent an MTS using correlation coefficients,we proposed to utilize the stationarity of time series beforecomputing the correlation coefficients of an MTS item [23].Intuitively, if a time series is stationary, it means that thestatistical properties of a time series, e.g., covariance andcorrelation coefficients, do not change over time. For de-tails, please refer to [23].

3 THE PROPOSED APPROACH

In this section, we will firstly describe the traditionalPCA in a little more detail, and then briefly describe the ker-nel PCA technique in relation to the traditional PCA, whichwill be followed by our proposed approach.

Assume that we are given a set of N items, and eachdata item is an n dimensional column vector, i.e., xi ∈ R

n,where 1 ≤ i ≤ N . Assume also that the data is mean cen-tered, i.e., n

i=1 xji = 0, for 1 ≤ j ≤ N . The covariancematrix can subsequently be computed as follows:

C =1N

N

i=1

xixTi

The traditional PCA then diagonalizes the covariance ma-trix to obtain the principal components, which can be

achieved by solving the following eigenvalue problem:

λv = Cv (2)

Kernel PCA extends this traditional PCA approach, andperforms PCA in the feature space. Hence, the data arefirst mapped into a high dimensional feature space usingΦ : RN

→ F, x → X. The covariance matrix in the featurespace can be described as follows, assuming that data arecentered:

C =1N

N

i=1

Φ(xi)Φ(xi)T

An N × N Kernel Matrix, which is also called as Grammatrix, can be defined as follows:

Kij = (Φ(xi) · Φ(xj)) = k(xi, xj)

and as Equation (2), one computes an eigenvalue problemfor the expansion coefficients αi, that is now solely depen-dent on the kernel function

λα = Kα (3)

Hence, intuitively, Kernel PCA can be performed by firstlyobtaining the Kernel Matrix, and then solving the eigen-value problem as in Equation (3). For details, please referto [18, 15]. Let us formally define the kernel function andthe kernel matrix [13].

Definition 2 A kernel is a function k, such that k(x, z) = <

Φ(x), Φ(z) > for all x, z ∈ X , where Φ is a mapping fromX to an (inner product) feature space F . A kernel matrix isa square matrix K ∈ R

N×N such that Kij = k(xi, xj) forsome x1, · · · , xN ∈ X and some kernel function k.

As in [13], the kernel matrices can be characterized as fol-lows:

Proposition 1 Every positive semi-definite and symmetricmatrix is a kernel matrix. Conversely, every kernel matrixis symmetric and positive semi-definite.

As a kernel function for Kernel PCA technique, we pro-pose to utilize Eros for MTS data sets. That is, givenan MTS data set X and a weight vector w, the KernelMatrix is constructed in such a way that K

Eros(i, j) =Eros(Xi,Xj , w). Note that Eros is not a distance metric,and cannot be readily represented in a form of dot productas the other kernel functions. However, according to Propo-sition 1, K

Eros can be utilized for Kernel PCA, as longas K

Eros is symmetric and positive semi-definite, i.e., theeigenvalues of K

Eros is non-negative. Firstly, Eros is sym-metric, i.e., Eros(Xi,Xj , w) = Eros(Xj ,Xi, w). Hence,K

Eros is symmetric. Consequently, as long as KEros is

positive semi-definite, KEros can be utilized for Kernel

4


PCA. In [16], a number of approaches to making a ma-trix into a PSD matrix have been described. In this pa-per, we utilize the first naıve approach4, which is to add

δI to KEros, i.e., K

Eros← K

Eros + δI, when KEros is

not PSD. For δ sufficiently larger in absolute value than the

most negative eigenvalue of KEros, K

Erosis PSD.

Algorithms 3 and 4 describes how to compute KEros,

and how to obtain the principal components in the featurespace. Given an MTS data set, and a weight vector w, wefirst construct the pair-wise similarity matrix, K

Eros, ofsize N ×N , where N is the number items in the given dataset as in Lines 2∼7 of Algorithm 3. Lines 8∼10 make sureKEros is PSD. The Kernel Matrix, KEros, is then mean-centered in the feature space in Line 3 of Algorithm 4. Theeigenvalue problem in the feature space, i.e., the Equation(3), is solved, and the principal components in feature spaceare obtained in Line 4.

Algorithm 3 Compute KEros

Require: MTS data set, X with N the number of items inthe data set and n the number of variables in an MTSdata; w a weight vector for Eros

1: Construct a Kernel Matrix using Eros2: for i = 1 to N do3: for j = i to N do4: KEros(i, j) ← Eros(Xi,Xj , w); Xi is the ith

MTS item in X5: K(j, i) ← K(i, j);6: end for7: end for8: if KEros is not PSD then9: KEros

← KEros + δI; choose sufficiently large δ

to make KEros PSD10: end if

Algorithm 4 Perform PCA using KEros

Require: MTS data set, N the number of items in the dataset, n the number of variables in an MTS data, w aweight vector for Eros

1: KEros← Computer the Kernel Matrix using Algo-

rithm 3;2: Center the Kernel Matrix in feature space

3: KEros

← KEros −O×KEros −KEros ×O + O×KEros ×O; where Oij = 1/N, 1 ≤ i, j ≤ N and N

is the number of items

4: [V,v] ← solve the eigenvalue problem λα = KEros

α;V contains the eigenvectors, and v the correspondingeigenvalues

4We plan to compare different approaches to transforming a non-PSDmatrix into a PSD matrix for Eros in the future.

After obtaining the principal components in featurespace using the training MTS items as in Algorithms 3and 4, the projection of the test MTS items on the principalcomponents is performed as in Algorithms 5 and 6. Intu-itively, Lines 1∼4 of Algorithm 6 describe how to map thetest data into feature space and subtract the pre-computedmean, i.e., mean-center the mapped data in the featurespace. Line 5 projects the mean-centered data onto the prin-cipal components, V, in the feature space, which is analo-gous to the traditional PCA approach. For details, pleaserefer to [18].

Algorithm 5 Compute KEros for Projection

Require: MTS data set, X with N the number of itemsin the data set and n the number of variables in anMTS data; w a weight vector for Eros; MTS dataset, Xtest with Ntest and n.

1: Construct a Kernel Matrix using Eros for Projection2: for i = 1 to Ntest do3: for j = 1 to N do4: KEros(i, j) ← Eros(Xtest,i,Xj , w); Xi is the

ith MTS item in X, and Xtest,i the ith MTS itemin Xtest

5: end for6: end for

Algorithm 6 Project Test Data Set using KEros

Require: MTS data set, X with N the number of items inthe data set and n the number of variables in an MTSdata; w a weight vector for Eros; test MTS data set,Xtest with Ntest and n; V obtained in Algorithm 4

1: KErostest ← Computer the Kernel Matrix using Algo-

rithm 5;2: KEros

← Computer the Kernel Matrix using Algo-rithm 3;

3: Center the Kernel Matrix in feature space

4: KEros

← KErostest − Otest × KEros − KEros

test × O +Otest×KEros×O; where Oij = 1/N, 1 ≤ i, j ≤ N ,Otest,ij = 1/N, 1 ≤ i ≤ Ntest, 1 ≤ j ≤ N

5: Y ← KEros

×V; The ith MTS item is represented asfeatures in the ith row of Y

4 PERFORMANCE EVALUATION

4.1 Datasets

The experiments have been conducted on two differentreal-world data sets, i.e., AUSLAN and BCI, which are alllabeled MTS data sets whose labels are given. The Aus-

5


Table 2. Summary of data sets used in the ex-periments

AUSLAN BCI

# of variables 22 64(average) length 60 3000

# of labels 95 2# of MTS items per label 27 189

total # of MTS items 2565 378

tralian Sign Language (AUSLAN) data set uses 22 sensorson the hands to gather the data sets generated by signing of anative AUSLAN speaker [11]. It contains 95 distinct signs,each of which has 27 examples. In total, the number ofsigns gathered is 2565. The average length is around 60.

The Brain Computer Interface (BCI) data set [12] wascollected during the BCI experiment, where a subject had toperform imagined movements of either the left small fingeror the tongue. The time series of the electrical brain activitywas collected during these trials using 64 ECoG platinumelectrodes. All recordings were gathered at 1000Hz. Thetotal number of items is 378 and the length is 3000.

Table 2 shows the summary of the data sets used in theexperiments.

4.2 Methods

For KEros, we first need to construct KEros. As de-scribed in Section 2.2, there are 6 different ways of obtain-ing weights for Eros. For the data sets used in the exper-iments, the mean aggregating function on the normalizedeigenvalues yields the overall best results, which are pre-sented in this section. In order to compute the classificationaccuracy of KEros, we performed 10 fold cross validation(CV) employing Support Vector Machine (SVM) [21]. Thatis, we break an MTS data set into 10 folds, use 9 folds toobtain the principal components in the feature space usingKEros and then project the data in the remaining 1 fold ontothe first 51 principal components to obtain 51 features. Wesubsequently computed the classification accuracy varyingthe number of features from 1 to 51. We repeated the 10fold cross validation ten times, and report the average clas-sification accuracy.

We compared the performance of KEros with two othertechniques, Kernel PCA using linear kernel (KLinear), andGeneralized Principal Component Analysis (GPCA) [24],in terms of classification accuracy. Since the linear kernel isthe simplest kernel for the Kernel PCA technique, we chosethe linear kernel as the performance baseline for the KernelPCA technique. Note that intuitively Kernel PCA using lin-ear kernel would perform similarly to vectorizing an MTSitem column-wise, i.e., concatenate columns back to back,

and performing PCA on it to extract the features.GPCA does not require vectorization of the data, and

works on each MTS item, i.e., a matrix, to reduce it to a(1, 2) dimensional matrix. In [24], the best results havebeen reported when 1

2= 1. Hence, we varied 1 and 2

from 2 to 7, and the sizes of the reduced matrix would be 4,9, 16, 25, 36 and 49, respectively. In order to utilize SVM,these reduced matrices have been vectorized column-wise.

One of the disadvantages of GPCA and KLinear is thatthe number of observations within the MTS items should beall the same, while KEros, i.e., Eros, can be applied to theMTS items with variable number of observations. Hence,for GPCA and KLinear, the AUSLAN data set have beenlinearly interpolated, so that all the items have the samenumber of observations, which is the mean number of ob-servations, i.e., 60.

For KLinear, STPRtool implementation [6] and SVM-KM implementation [4] are utilized. For KEros, we mod-ified the Kernel PCA routine in STPRtool and SVM-KM.We implemented GPCA from scratch. All the implementa-tions are written in Matlab.

4.3 RESULTS

In order to check if the pair-wise similarity matrix com-puted by using Eros, i.e., KEros, is positive semi-definite,we obtained the eigenvalues of KEros. For the AUSLANdata set, the minimum eigenvalue of KEros is 3.2259e-06,and for the BCI data set, it is 0.0014. Hence, KEros forthe AUSLAN and BCI data sets turned out to be symmet-ric and positive semi-definite, i.e., all the eigenvalues arenon-negative. Consequently, we did not need to add δI toKEros; for the AUSLAN and BCI data sets, KEros is uti-lized as is as the Kernel Matrix for the Kernel PCA tech-nique.

Figure 2(a) shows the results of the classification accu-racy for the AUSLAN data set. Using only 14 features ob-tained by KEros, the classification accuracy is over 90%.As we increase the number of features for SVM, the perfor-mance of KLinear improves and when the number of fea-tures is more than 40, the performance difference betweenKLinear and KEros is almost negligible. The performanceof GPCA, however, is much worse than the Kernel PCAtechnique. Even when 49 features are employed, the classi-fication accuracy is less then 80%, while the others achievedmore than 90% of classification accuracy. There may bea couple of reasons for this poor performance of GPCA.Firstly, in [24], the data sets contain images which are rep-resented in approximately square matrices. For the AUS-LAN data set, however, each MTS item is not square; thenumber of observations is almost three times the numberof variables. Hence, the 1 and 2 parameters for GPCAshould be re-evaluated. Secondly, the result of dimension

6


0 10 20 30 40 500

10

20

30

40

50

60

70

80

90

100

# of features

Cla

ssifi

catio

n A

ccur

acy

(%)

KErosKLinearGPCA

5 10 15 20 25 30 35 40 45 500

10

20

30

40

50

60

70

80

90

100

# of features

Cla

ssifi

catio

n A

ccur

acy

(%)

KErosKLinearGPCA

(a) AUSLAN Dataset (b) BCI Dataset

Figure 2. Classification Accuracy Comparison

reduction using GPCA is still a matrix; a vectorization isrequired so that SVM can be utilized. Our vectorization bysimply concatenating the columns may have resulted in theloss of correlation information.

Figure 2(b) represents the classification accuracies of thethree techniques on the BCI data set. Similarly as for theAUSLAN data set, KEros outperforms other techniques interms of classification accuracy. When 16 features are used,KEros yielded more than 70% of classification accuracy.Unlike for the AUSLAN data set, KLinear does not per-form as well as KEros as the number of features increased.16 features from KLinear achieved just more than 60% ofclassification accuracy. The performance of GPCA is notgood for the BCI data set as well; the classification accuracyis more or less the chance level, i.e., 50%. As described forthe AUSLAN data set, the parameters for GPCA seem to re-quire re-configuration for the data sets whose items are notsquare matrices.

5 CONCLUSIONS AND FUTURE WORK

In this paper, we proposed a technique to utilize KernelPCA technique to extract features from MTS data sets usingEros as its similarity measure, termed KEros. Using Eros asa similarity measure between two MTS items, the correla-tion information between UTSs in one MTS item would notbe lost. In addition, utilizing the Kernel Trick, KEros doesscale well in terms of dimensionality of data sets. KErosfirst constructs the pair-wise similarity matrix using Eros,KEros. In order to be utilized as a Kernel Matrix for theKernel PCA technique, KEros is naıvely transformed, ifnecessary, in such a way that the transformed KEros is posi-tive semi-definite, i.e., all the eigenvalues of KEros are non-negative. Our experimental results show that using KEros

to extract features, the classification accuracy is up to 60%better than using features extracted using linear kernel, andGeneralized Principal Component Analysis (GPCA) [24].

We intend to extend this research in two directions.Firstly, more comprehensive experiments with more real-world data sets will be performed including comparisonswith other techniques such as Kernel LDA [15]. In [25],we utilized the principal component loadings to identify asubset of variables that are least redundant in terms of con-tributions to the principal components. We plan to exploresimilar feature subset selection techniques utilizing kernelmethods.

Acknowledgement

This research has been funded in part by NSF grantsEEC-9529152 (IMSC ERC), IIS-0238560 (PECASE) andIIS-0307908, and unrestricted cash gifts from Microsoft.Any opinions, findings, and conclusions or recommenda-tions expressed in this material are those of the author(s)and do not necessarily reflect the views of the National Sci-ence Foundation. The authors would also like to thank theanonymous reviewers for their valuable comments.

References

[1] M. Aizerman, E. Braverman, and L. Rozonoer. Theoret-ical foundations of the potential function method in pat-tern recognition learning. Automation and Remote Control,25:821–837, 1964.

[2] J. Alon, S. Sclaroff, G. Kollios, and V. Pavlovic. Discover-ing clusters in motion time-series data. In IEEE ComputerVision and Pattern Recognition, pages 18–20, June 2003.

[3] T. Bozkaya and M. Ozsoyoglu. Indexing large metric spacesfor similarity search queries. ACM TODS, 24(3), 1999.

7


[4] S. Canu, Y. Grandvalet, and A. Rakotomamonjy. Svm andkernel methods matlab toolbox. Perception Systmes et In-formation, INSA de Rouen, Rouen, France, 2003.

[5] A. Corradini. Dynamic time warping for off-line recognitionof a small gesture vocabulary. In IEEE RATFG, pages 82–89, July 2001.

[6] V. Franc and V. Hlavac. Statistical pattern recognition tool-box for matlab. http://cmp.felk.cvut.cz/∼xfrancv/stprtool/,June 2004.

[7] C. Goutte, P. Toft, E. Rostrup, F. A. Nielsen, and L. K.Hansen. On clustering fMRI time series. NeuroImage,9(3):298–310, 1999.

[8] D. J. Hand, P. Smyth, and H. Mannila. Principles of datamining. MIT Press, Cambridge, MA, USA, 2001.

[9] S. Hettich and S. D. Bay. The UCI KDD Archive.http://kdd.ics.uci.edu, 1999.

[10] I. T. Jolliffe. Principal Component Analysis. Springer, 2002.[11] M. W. Kadous. Temporal Classification: Extending the

Classification Paradigm to Multivariate Time Series. PhDthesis, University of New South Wales, 2002.

[12] T. N. Lal, T. Hinterberger, G. Widman, M. Schroder, N. J.Hill, W. Rosenstiel, C. E. Elger, B. Scholkopf, and N. Bir-baumer. Methods towards invasive human brain computerinterfaces. In Advances in Neural Information ProcessingSystems 17, pages 737–744. Cambridge, MA, 2005.

[13] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui,and M. I. Jordan. Learning the kernel matrix with semidefi-nite programming. J. Mach. Learn. Res., 5:27–72, 2004.

[14] Q. Li, J. Ye, and C. Kambhamettu. Linear projection meth-ods in face recognition under unconstrained illuminations: acomparative study. In Computer Vision and Pattern Recog-nition, 2004. CVPR 2004. Proceedings of the 2004 IEEEComputer Society Conference on, volume 2, pages 474–481,July 2004.

[15] K.-R. Muller, S. Mika, G. Ratsch, K. Tsuda, andB. Scholkopf. An introduction to kernel-based learningalgorithms. IEEE Trans. Pattern Anal. Machine Intell.,12(2):181–201, March 2001.

[16] X. Nguyen, M. I. Jordan, and B. Sinopoli. A kernel-basedlearning approach to ad hoc sensor network localization.ACM Trans. Sen. Netw., 1(1):134–152, 2005.

[17] C. Rao, A. Gritai, M. Shah, and T. Syeda-Mahmood. View-invariant alignment and matching of video sequences. InIEEE ICCV, pages 939–945, October 2003.

[18] B. Scholkopf, A. J. Smola, and K.-R. Muller. Nonlinearcomponent analysis as a kernel eigenvalue problem. NeuralComputation, 10(5):1299–1319, 1998.

[19] C. Shahabi. AIMS: An immersidata management system. InVLDB CIDR, January 2003.

[20] A. Tucker, S. Swift, and X. Liu. Variable grouping in multi-variate time series via correlation. IEEE Trans. Syst., Man,Cybern. B, 31(2):235–245, 2001.

[21] V. Vapnik. Statistical Learning Theory. New York: Wiley,1998.

[22] K. Yang and C. Shahabi. A PCA-based similarity measurefor multivariate time series. In MMDB ’04: Proceedings ofthe 2nd ACM international workshop on Multimedia data-bases, pages 65–74, Washington, DC, USA, 2004. ACMPress.

[23] K. Yang and C. Shahabi. On the stationarity of multivariatetime series for correlation-based data analysis. In The FifthIEEE International Conference on Data Mining, Houston,TX, USA, November 2005.

[24] J. Ye, R. Janardan, and Q. Li. Gpca: an efficient dimensionreduction scheme for image compression and retrieval. InKDD ’04: Proceedings of the tenth ACM SIGKDD interna-tional conference on Knowledge discovery and data mining,pages 354–363, New York, NY, USA, 2004. ACM Press.

[25] H. Yoon, K. Yang, and C. Shahabi. Feature subset selectionand feature ranking for multivariate time series. IEEE Trans.Knowledge Data Eng. - Special Issue on Intelligent DataPreparation, 17(9), September 2005.

[26] X. L. Zhang, H. Begleiter, B. Porjesz, W. Wang, andA. Litke. Event related potentials during object recognitiontasks. Brain Research Bulletin, 38(6):531–538, 1995.

8


Fast similarity search of time series datausing the Nystrom method

Akira HayashiFaculty of Information Sciences, Hiroshima City University

3-4-1 Ozuka-higashi, Asaminami-ku, Hiroshima 731-3194, Japan,[email protected]

Katsutoshi NishizakiNEC Fielding, Ltd.,

1-4-28, Mita, Minato-Ku, Tokyo, 108-0073, [email protected]

Nobuo SuematsuFaculty of Information Sciences, Hiroshima City University

3-4-1 Ozuka-higashi, Asaminami-ku, Hiroshima 731-3194, Japan,[email protected]

Abstract

We consider how to speed up similarity search of vectorvalued time series data, when the dissimilarity is definedin terms of DTW distances. We take an approach of em-bedding time series data in a low dimensional Euclideanspace, and performing a multi dimensional search. In orderto speed up the embedding process, we propose to use theNystrom method, a method originally for a numerical solu-tion of integral equations. Thanks to the Nystrom method,DTW distances only to a small number of samples are suf-ficient to embed data with high accuracy. Let the numberof time series data in DB and the number of samples be n

and m (m << n), respectively. The time complexity of theproposed method for each query is O(ml2), while the com-plexity of the linear search is O(nl2), where l is the averagelength of time series data.

1. Introduction

1.1. Background

Mining from a large collection of time series data is gain-ing more and more attention recently, and much research isdone on similarity search, classification, clustering, and seg-mentation of time series data [11]. Among them, similaritysearch is important not only for its direct use but also for its

use as preprocessing for classification and clustering.As a dissimilarity measure in similarity search, distance

obtained from dynamic time warping (DTW) [14] is fre-quently used [19, 10]. DTW evaluates all possible matchesbetween two time series data allowing time warping, andfinds the match with the smallest distance, i.e. the largestsimilarity (see Sec.A). When compared with the Euclideandistance, DTW distance is more robust to speed change, andis considered to reflect the human perception of similarity.

1.2. Problem De nition

We consider how to speed up the similarity search ofvector valued time series data, where the dissimilarity isdefined in terms of DTW distance. More concretely, weconsider the following problem.

A set of n time series data (time series DB) : X =X1, . . . , Xn, is given, where Xi (1 ≤ i ≤ n) is asequence of feature vectors whose length is li : Xi =(xi

1, . . . ,xili). Given q query, another time series data,

Q = (q1, . . . , ql), find fast the k nearest neighbors of Q,i.e. find k X ′

is with the smallest DTW distances.

1.3. Proposed Method

We take an approach of embedding time series data ina low dimensional Euclidean space, and performing a mul-tidimensional search. Multidimensional search using K-d


trees or R-trees [15] is known to be effective for similar-ity search of vector data data. Unfortunately, the similaritysearch of time series data using DTW distances is difficultbecause of the following problems.

• Computing DTW distances takes time O(l2), where l

is the length of time series.

• DTW distances do not satisfy the triangle inequality:D(X, Z) ≤ D(X, Y ) + D(Y, Z).

In order to speed up the embedding process, the key issueis how to embed the data accurately from a small number ofDTW distances. We propose to use the Nystrom method, amethod originally for a numerical solution of integral equa-tions [1, 13]. Thanks to the Nystrom method, DTW dis-tances only to a small number of samples are sufficient toembed data with high accuracy. Let the number of time se-ries data in DB and the number of samples be n and m,respectively. The time complexity of the proposed methodfor each query is O(ml2), while the complexity of the lin-ear search is O(nl2), where l is the average length of timeseries data.

We consider multi dimensional scaling (MDS) [16] andthe Laplacian Eigenmap (LE) [2] as candidate embeddingmethods. While MDS is well known as an embeddingmethod to preserve distances, MDS has a drawback thatits kernel matrix is not semi-positive definite for DTW dis-tances. While LE preserves neighborhood relationships andits Laplacian matrix is always semi-positive definite, LEtransforms distances nonlinearly. We apply the Nystrommethod to MDS and LE, and evaluate their performance inan experiment using a large scale DB of time series data.

1.4. Related Work

FastMap[4] is a method to speed-up MDS by usingheuristics. In Fastmap, pairs of data called pivots are firstselected which approximate the eigenvectors of the kernelmatrix which can be obtained from the distance matrix. Co-ordinates in the embedded space spanned by the pivots arethen computed by projecting the data to each axis betweeneach pair of pivots. Yi et al. [19] embed time series data ina Euclidean space based on DTW distances using FastMap,and perform a multidimensional search. By using FastMap,they　 cut down the number of DTW distance calculations.However, since the number of pivots are detemined from theembedded space dimension, high accuracy of the embed-ding is difficult to obtain for low dimensional embeddings1.

Since multidimensional search has the above problem,linear search has been considered inevitable. Therefore,

1For an efficient multidimensional search, the dimension should be nomore than 10 to 202

[10] computes the lower bound of DTW distances fast, toquickly filter out non-similar time series. But this techniqueis limited to scalar valued time series data, and is difficult toextend to vector valued time series data.

The Nystrom method is originally for a numerical solu-tion of Fredholm integral equations of the second kind [13].It is recently gaining attention in pattern recognition andmachine learning communities. Williams et al. [18] use themethod to speed up kernel PCA. Fowlkes et al.[5] use it tospeed up spectral clustering for image segmentation. Ben-gio et al.[3] link spectral embedding and kernel PCA fromthe viewpoint of learning eigenfunctions of integral equa-tions.

1.5. Paper Organization

We explain about the Euclidean space embedding meth-ods and Nystrom extension in Sec. 2 and Sec. 3, respec-tively. Our proposed method is given in Sec. 4. We reporton the experiment in Sec. 5, and then conclude in Sec. 6.

2. Embedding Methods

2.1. MDS

Let d(Xi, Xj) | 1 ≤ i, j ≤ n be DTW distances be-tween time series data in X . MDS[16] is a method to ob-tain embedding coordinates zi for each Xi by a mappingΦ : X → n such that the following holds.

‖Φ(Xi) − Φ(Xj)‖2 = d2(Xi, Xj) (1 ≤ i, j ≤ n)(1)

Let the centralized inner product (kernel) matrixk(Xi, Xj) be such that k(Xi, Xj) = 〈zi − z,zj − z〉,z = 1

n i zi, then, from the relationship between innerproducts and distances in Euclidean space, the followingholds [7].

k(Xi, Xj) = −12d2(Xi, Xj) +

12n

n

l=1

d2(Xi, Xl)

+12n

n

l=1

d2(Xj , Xl) − 12n2

n

l=1

n

m=1

d2(Xl, Xm) (2)

Then, the kernel matrix K : Kij = k(Xi, Xj) is decom-posed through eigenvalue analysis as follows.

K = UΛUT (3)

where Λ = diag(λ1, . . . , λn), λ1 ≥ . . . ≥ λn is a diagonalmatrix of eigenvalues, and U = [e1, . . . ,en] is a matrix ofeigenvectors.

When K is semi-positive definite, i.e. when λn ≥ 0, letZ = Λ

12 UT , and K = ZT Z holds. Hence, we can view


the i-th column of Z as zi − z. Translate the origin to thecentroid, and denote by zi (1 ≤ i ≤ n) the new coordi-nates. Consider the projection from n to p which bestapproximates the kernel matrix in terms of the Frobeniusnorm. The image of zi by the projection, zi, is expressedas follows using p largest eigenvalues / vectors of K.

zi = ( λ1e1(i), λ2e2(i), . . . , λpep(i))T

(1 ≤ i ≤ n) (4)

where ek(i) is the i-th element of the eigenvector ek.Unfortunately, K, the matrix defined in (2), is not semi-

positive definite, because DTW distances do not satisfy thetriangle inequality. Nevertheless, we embed data in theEuclidean space using (4), simply by neglecting negativeeigenvalues /vectors.

2.2. Laplacian Eigenmap

Laplacian Eigenmap (LE) is an embedding methodwhich preserves neighborhood relationships [2]. LE hasan advantage in that its Laplacian matrix defined below isalways semi-positive definite, even if the distances do notsatisfy the triangle inequality [7]. An algorithm to embedX , a set of time series data, in p using LE follows.

1. Compute the similarity matrix W from DTW dis-tances.

W ij = e−d2(Xi,Xj)/t i = j ∧ Xj ∈ Nε(Xi)0 if otherwise

(5)where Nε(Xi) stands for an ε neighborhood of Xi, andt(> 0) is a hyper parameter.

2. Compute the Laplacian Matrix L.L = D − W ,D is a diagonal matrix such that Dii =

nj=1 W ij .

3. Solve a generalized eigenvalue problem for p smallesteigenvectors e1, . . . ,ep(λ1 ≤ λ2 ≤ . . . ≤ λp).

Le = λDe (6)

4. Compute the coordinates in the embedded space.Let U = e1, e2, . . . ,ep ,then Z = [z1, . . . , zn] = UT , i.e.

zi = (e1(i), e2(i), . . . ,ep(i))T (1 ≤ i ≤ n) (7)

2.3. Difficulties

Unfortunately, we cannot use MDS or LE in their orig-inal forms from the following reasons. For each query Q,we need to perform the eigen decomposition to embed Q,which takes O(n3) time. Moreover, in order to embed Q,we need to compute DTW distances to all data in the DB,d2(Q,Xi) (1 ≤ i ≤ n), in the first place.

3. the Nystrom Method

3.1. Nystrom Extension

The Nystrom method is originally for a numerical so-lution of Fredholm integral equations of the second kind[1, 13]. We consider here the following homogeneous inte-gral equation.

K(x,y)fk(y)p(y)dy = λ′kfk(x) (8)

Given a set of samples, D = xi|1 ≤ i ≤ m, we ap-proximate the density, p(x), by the sample distribution,1m

mi=1 δ(x − xi), and obtain the following.

λ′kfk(x) ≈ 1

m

m

i=1

K(x,xi)fk(xi) (9)

By substituting x = xi, i = 1, 2, . . . , m to (9), and solv-ing the resulting m dimensional eigenvalue problem (10),we can obtain the values of the eigenfunction fk(x) at sam-pling points, xi ∈ D.

λ′kfk(xi) =

1m

m

j=1

K(xi,xj)fk(xj) (1 ≤ i ≤ m) (10)

Note that the equations (10) and (3) (when only the samplesare embedded) are the same except for the constant 1

m .Thatis to say

fk(xi) = ek(i) (1 ≤ i ≤ m)λ′

k = 1mλk

(11)

Given x /∈ D, Nystrom extension approximately com-putes fk(x), the value of the eigenfunction, without solvingthe eigenvalue problem again. Assume λ′

k = 0 in (9) andwe obtain:

fk(x) =1

mλ′k

m

i=1

K(x,xi)fk(xi) (12)

The Nystrom extension can be seen as an interpolation /extrapolation method. There are many such methods. Butthe Nystrom extension uses (12) for interpolation / extrap-olation, which is essentially the same equation as (9) usedfor obtaining the function values at sampling points. Thiscontributes to its approximation accuracy [13]．

3.2. Applying to MDS

Denote the set of samples of time series data as X =X1, . . . , Xm. We apply the Nystrom extension and thusobtain, without solving the eigenvalue problem twice, the


embedding coordinates z for time series data X which isnot contained in X .

The Nystrom extension is originally about kernel func-tions defined analytically (see (8)). But we can apply theextension to MDS, a kind of data dependent kernel. For thepurpose, we first extend the domain of the kernel functionfrom sample pairs (see (2)) to any X, X ′ as follows.

k(X, X ′) = −12d2(X, X ′) +

12m

m

l=1

d2(X, Xl)

+1

2m

m

l=1

d2(X ′, Xl) − 12m2

m

l=1

m

l′=1

d2(Xl, Xl′) (13)

In summary, after solving the eigenvalue problem in(10), and obtaining fk(Xi) (1 ≤ i ≤ m), k = 1, . . . , p,the values of the eigenfunction at samples, we apply theNystrom extension to MDS as follows.

1. Non sample data X is given.

2. Compute k(X, Xi) (1 ≤ i ≤ m) from (13).

3. Obtain fk(X), k = 1, . . . , p from (12).

4. Let the embedding coordinates of X be z =( λ′

1f1(X), . . . , λ′pfp(X))T .

3.3. Applying to LE

Applying the Nystrom extension to LE is more com-plicated. This is because while MDS solves an eigen-value problem, LE solves a generalized eigenvalue prob-lem, which does not directly correspond to (10). Here wederive the application on the basis of applying the Nystromextension to spectral clustering (SC) developed by Benjioet al.[3].

SC solves an eigenvalue problem for the kernel matrixKSC defined as follows.

KSC = D− 12 WD− 1

2 i.e. KSCij =

W ij√Dii Djj

(14)

where W and D are the same as those defined for LE. Theapplication of the Nystrom extension is based on the fol-lowing kernel function.

k(X, X ′) =W (X, X ′)

D(X, X) D(X ′, X ′)(15)

where D(X, X) = 1m

ml=1 W (X, Xl), D(X ′, X ′) =

1m

ml=1 W (X ′, Xl).

Let λSCk and eSC

k be the eigenvalues / vectors of KSC in(14), and λLE

k ，eLEk be the generalized eigenvalues /vectors

for LE in (6). Then we can easily show that the followingholds.

eLEk = D− 1

2 eSCk

λLEk = 1 − λSC

k

(16)

In summary, after solving the eigenvalue problem forKSC in (14) and obtaining fk(Xi) (1 ≤ i ≤ m), k =1, . . . , p, the values of the eigenfunction at samples, we ap-ply the Nystrom extension to LE as follows.

1. Non sample data X is given.

2. Compute k(X, Xi) (1 ≤ i ≤ m) from (15).

3. Obtain fk(X), k = 1, . . . , p from (12).

4. Let eSCk = (fk(X1), . . . , fk(Xm), fk(X))T . Com-

pute eLEk , k = 1, . . . , p, the solution for the general-

ized eigenvalue problem for LE using (16).

5. Let the embedding coordinates of X be z =(eLE

1 (m + 1), . . . ,eLEp (m + 1))T .

4. Proposed Method

The proposed method consists of two phases, the pre-processing phase and the query phase. In the preprocessingphase, we embed time series data in DB, and construct amultidimensional search tree such as a k-d tree or a R-tree[15]. In the query phase, we embed the query time series,and using its coordinates as a key, we perform a multidi-mensional search in the tree.

4.1. Preprocessing Phase

1. A set of time series data (time series DB) X =X1, . . . , Xn is given.

2. Select m(m n) samples from X , and reassignthe subscript so that the set of samples are X =X1, . . . , Xm.

3. Compute DTW distances between samples:d2(Xi, Xj) (1 ≤ i, j ≤ m).

4. Embed all the samples Xi ∈ X in p, i.e., obtainzi ∈ p|1 ≤ i ≤ m. For MDS, use (4) or (10).For LE, use (16) after solving the eigenvalue problemin (14).

5. Use the Nystrom method in Sec.3.2 or 3.3 to embedXi (m + 1 ≤ i ≤ n), and obtain zi ∈ p|m + 1 ≤i ≤ n.

6. Construct a multidimensional search tree [15] forzi ∈ p|1 ≤ i ≤ n.


4.2. Query Phase

1. A query time series data Q is given.

2. Compute the DTW distances between Q and m sam-ples d2(Q,Xi) (1 ≤ i ≤ m).

3. Use the Nystrom method in Sec.3.2 or 3.3 to embedthe query Q, and obtain zQ ∈ p.

4. Use zQ as the search key, perform a multidimen-sional nearest neighbor search in zi|1 ≤ i ≤ n,and obtain the neighborhood, Nbr(Q) = Xi|zi ∈Nbr(zQ), 1 ≤ i ≤ n.

5. Compute the DTW distances between Q and those inits neighborhood in p, d2(Q,Xi)|Xi ∈ Nbr(Q),and return the truly nearest neighbors.

4.3. Time Complexity

We consider the time complexity of the query phase, onwhich the response time of the system depends.

Since the complexity of computing each DTW distanceis O(l2) where l is the (average) length of the time seriesdata, the complexity of Step (2) is O(ml2). The com-plexity of Step (3) is O(mp) 3. Let us denote the timefor multidimensional search in Step (4) as Ts. The com-plexity of Step (5) is O(|Nbr(Q)| · l2). Then, the total isO(ml2) + O(mp) + Ts + O(|Nbr(Q)| · l2).

We compare the time complexity of our method with thatof a linear search. Under the reasonable assumptions thatp << l2 and that Ts and |Nbr(Q)| do not depend on eithern or m, the complexity of our method is O(ml2), whilethe complexity of a linear search is O(nl2). If m increasesmore slowly than n, our method will have the lower orderof time complexity.

5. Experiment

Objectives for the experiment are to compare the candi-date embedding methods (MDS and LE) 4, and to evaluatehow the sample size m has an effect on the embedding ac-curacy. We use three kinds of data, vector data, synthetictime series data, and real world time series data.

In the experiment, the DTW distances are computed us-ing the procedure in Sec.A. Samples are selected using amodified K-means method, which is K-means using dis-tances only (see Sec.B). K-means sampling have showedslightly better performance than random sampling.

3We can compute k(Q, Xi)|1 ≤ i ≤ m in (13) form MDS in O(m)time by computing terms which do not depend on Q in advance, and bycomputing common terms just once.

4FastMap could not embed the time series data in more than 3 dimen-sional space, because the cosine law which FastMap uses to compute co-ordinates does not hold for DTW distances.

5.1. Task

We choose, as the task, to search for 20 nearest neigh-bors (NNs) in the time series DB . We compute recall-precision (RP) curves for each embedding method with dif-ferent number of samples. Recall and precision are fre-quently used performance indexes for information retrievalsystems.

We view up to k (k > 20) NNs in the embedded spaceas retrieved results (positive), and count how many of themare true, i.e. are within 20 NNs in terms of DTW distance.Let the number of true positives be l. Then, recall (R) andprecision (P) are computed as follows.

R = l20

P = lk

(17)

By the way, we conjecture that the retrieval performancedepends not only on the number of samples, but also onwhich data are selected as samples. How to choose appro-priate sample should be an important problem in its ownright. However, we choose samples using the modified k-means explained in B.

5.2. MNIST

MNIST[12] is a collection of image data for hand-written digits with 28 by 28 pixels, which are representedby 784 dimensional vectors. Here, we use Euclidean dis-tance as a dissimilarity measure. The purpose of this ex-periment is twofold. One is to know the upper bound of theperformance, when the kernel matrix defined in (2) does nothave negative eigenvalues, and the other is to compare theperformance with that of FastMap.

Fig.1 shows the RP curves for the dataset. The DB size,n = 5923, and the dimension of the embedded space, p =10. The average of 980 queries was taken. We can see fromFig.1 that the embeddings using MDS are better than thoseusing LE. LE is still better than FastMap.

Let us explain the embedding accuracy quantitatively.For MDS, when we cut down the sample data size m to50% of the total data n, the corresponding RP curve remainsalmost in the same place, which means that there is very lit-tle decrease in the retrieval performance. When m is only1% of n, the performance degrades, but is still better thanthat of LE without sampling. In order to find 18 nearestneighbors out of 20 (R = 90%) with 100% samples , wehave to search up to 33 neighbors in the embedded space(P ≈ 55%), whereas with 1% samples, we have to searchup to 60 neighbors (P ≈ 30%).

5.3. CBF

Cylinder-Bell-Funnel(CBF) [10] is a data set of synthetictime series, frequently used in time series data mining. CBF


Figure 1. MNIST: RP Curves for MDS(m=5923,2961,60)，LE(m=5923,2961,297),and FastMap. The DB size, n = 5923, and thedimension of the embedded space, p = 10.The average of 980 queries was taken.

consists of scalar valued time series data, synthesized usingrandom numbers. There are 3 classes in CBF, and we useone of them, Cylinder, to synthesize time series, c(t)|1 ≤t ≤ 32 from (18).

c(t) = (8 + z)R[a, b](t) + e(t) (18)

R[a, b] =0 t < a or t > b

1 a ≤ t ≤ b

where z, e(t) are sampled from N (0, 1), a, an integer, issampled uniformly from [4, 8], and (b − a), an integer, issampled uniformly from [8, 24].

Fig.2 shows the RP curves for CBF data. The DB size,n = 3000, and the dimension of the embedded space, p =19. The average of 150 queries was taken.

Fig.3 shows the eigenvalue distribution. To draw the fig-ure, we first sorted the eigenvalues in descending order, andthen take absolute values. Almost half of the eigenvaluesare negative. The maximal eigenvalue is 8.275, the mini-mum is −5.095.

5.4. ASL

The ASL data[9] in UCI KDD Archive[8] consists of 95sign words obtained from 5 subjects, each of which is asequence of 9 dimensional feature vectors 5. Each sign wordhas about 70 instances.

5Subjects are equipped with a Nintendo Power-Glove in their righthands, and 3D coordinates and roll (is it pointing up or down?) of thepalm, bend of five fingers are measured.

Figure 2. CBF: RP Curves for MDS(m=3000,1500,80) and LE (m=3000,1500,450).n = 3000, p = 19. The average of 150 queries.

Figure 3. CBF: Distribution of the eigenvaluesof the kernel matrix in (2) for the DB time se-ries.


We use as DB time series instances for 43 words such as”change”,”deaf”,”glad”,”her”, and ”innocent”, which havesimilar words. We use instances for ”lose” and ”love” asquery time series.

Fig.4 shows the RP curves for ASL data. The DB size,n = 3000, and the dimension of the embedded space, p =19. The average of 150 queries was taken.

Figure 4. ASL: RP Curves forMDS (m=3000,1500,600) and LE(m=3000,1500,300). n = 300, and p = 19.Average of 150 queries.

Fig.5 shows the eigenvalue distribution. Almost half ofthe eigenvalues are negative. The maximal eigenvalue is8.273, and the minimum eigenvalue is −1.72043.

5.5. Discussion

We can explain why MDS performs best for MNIST,whose kernel matrix have no negative eigenvalues. It is be-cause the embedding by MDS and the succeeding dimen-sionality reduction is essentially the same as principal com-ponent analysis (PCA)[17]. PCA is known to be the optimaldimensionality reduction technique in that the approxima-tion error is kept minimal. Although LE maps neighbor-hoods to neighborhoods in a qualitative sense, it transformsthe distance nonlinearly (see (5)), and hence it does not pre-serve DTW distances quantitatively.

Why MDS performs better than LE for ASL in Fig.4?We conjecture the reason as follows. While we neglectedmany negative eigenvalues / vectors in MDS (see Fig.5for the eigenvalue distribution), we nonlinearly transformedDTW distances in LE. We do not know how much theseinfluenced the embedding accuracy exactly, but the resultseems favorable to MDS for the ASL dataset.

The performance down according to the Nystrom ap-proximation was far less than we had expected. MDS

Figure 5. ASL: Distribution of the eigenvaluesof the kernel matrix in (2) for the DB time se-ries.

showed good performance even with small sample sizes.This implies that the embedded space obtained from a smallsample set is not so much different from that obtained froma whole data set. As to LE, we could not cut down sam-ples very much in the experiment, because LE depends onneighborhood relations to compute its similarity matrix (see(5)).

By the way, LE is one of the non-linear dimensionalityreduction methods known as manifold learning. Manifoldlearning is known to be more effective than linear methodssuch as PCA, when the data lie on a curved manifold in ahigh dimensional space. However, as far as our MNIST andASL data are concerned, such effect was not observed in theexperiment.

6. Conclusions

We have considered how to speed up similarity searchof vector valued time series data, when the dissimilarityis defined in terms of DTW distances. We have taken anapproach of embedding time series data in a low dimen-sional Euclidean space, and performing a multidimensionalsearch. In order to speed up the embedding process, wehave proposed to use the Nystrom method. Thanks to theNystrom method, DTW distances only to a small numberof samples are sufficient to embed data with high accuracy.Let the length of time series be l, the number of time seriesdata in DB be n, and the number of samples be m. The com-plexity of the proposed method for each query is O(ml2),


while the complexity of the linear search is O(nl2).We have considered MDS and Laplacian Eigenmap

as candidate embedding methods. We have applied theNystrom method to the embeddings, and evaluated the per-formance in an experiment using a large scale DB of timeseries data. In the experiment, MDS embedding with theNystrom approximation showed good performace even witha small number of samples.

References

[1] C. Baker, ”The numerical treatment of integral equa-tions”, Oxford University Press, 1977.

[2] M. Belkin, P. Niyogi, ”Laplacian Eigenmaps for Di-mensionality Reduction and Data Representation”,Neural Computation, 15 (6),pp.1373-1396, 2003.

[3] Y. Bengio et al., ”Learning eigenfunctions links spec-tral embedding and kernel PCA”, Neural Computa-tion, 16(10),pp.2197-2219, 2004.

[4] C. Faloutsos and K. Lin, ”FastMap: A Fast Algorithmfor Indexing, Data-Mining and Visualization of Tra-ditional and Multimedia Datasets”, Proc. ACM SIG-MOD, pp. 163-174,1995.

[5] C. Fowlkes, S. Belongie, F. Chung, and J. Malik,”Spectral grouping using theNystrom method”, IEEETrans. PAMI, 26(2), pp.214-225, 2004.

[6] A. Gionis, P. Indyk, and R. Motwani, ”Similaritysearch in high dimensions via hashing”, Proc. Int.Conf. Very Large Databases, pp.518-529, 1999.

[7] A. Hayashi, Y. Mizuhara, and N. Suematsu, ”Em-bedding time series data for classification”, Int. Conf.Machine Learning and Data Mining, MLDM2005,pp.356-365,2005.

[8] S. Hettich and S.D. Bay, UCI Repository of KDDDatabases, http://kdd.ics.uci.edu/ 1999.

[9] W. Kadous, Australian Sign Lan-guage Data in the UCI KDD Archive[http://www.cse.unsw.edu.au/ waleed/tml/data/].

[10] E. Keogh, ” Exact Indexing of Dynamic Time Warping”, proc. VLDB, pp.406-417, 2002.

[11] E. Keogh and S. Kasetty, ”On the Need for Time Se-ries Data Mining Benchmarks: A Survey and Em-pirical Demonstration”, In the 8th ACM SIGKDDInt. Conf. Knowledge Discovery and Data Mining,pp.102-111, 2002.

[12] ”THE MNIST DATABASE of handwritten digits”,http://yann.lecun.com/exdb/mnist/

[13] W. Press et al., ”Numerical recipes in C: the art of sci-entific computing”, Cambridge University Press, 2ndedition, 1992.

[14] L. Rabiner and B. Juang, ”Fundamentals of SpeechRecognition”, Prentice Hall, 1993.

[15] H. Samet, The Design and Analysis of Spatial DataStructures, Addison-Wesley, 1989.

[16] W.S. Torgerson, Theory and methods of scaling,J.Wiley & Sons, 1958.

[17] C. Williams, ”On a Connection between Kernel PCAand Metric Multidimensional Scaling”, Advances inNeural Information Processing Systems 13, pp. 675-681, 2001.

[18] C. Williams and M. Seeger, ”Using the Nystrommethod to speed up kernel machines”, Advances inNeural Information Processing Systems 13, pp. 682-688, 2001.

[19] B. Yi, H. Jagadish, and C. Faloutsos, ” Efficient Re-trieval of Similar Time Sequences Under Time Warp-ing”, ICDE, pp. 201-208, 1998.

A. DTW distance

Let ‖ · ‖ be the Euclidean norm.

1. Initialize: g(0, 0) = 0

2. Repeat: for 1 ≤ ti ≤ li，1 ≤ tj ≤ lj

g(ti, tj) = ming(ti − 1, tj) + ‖xi

ti− xj

tj‖2

g(ti − 1, tj − 1) + 2‖xiti− xj

tj‖2

g(ti, tj − 1) + ‖xiti− xj

tj‖2

3. Finish: d2(Xi, Xj) = g(li, lj)/(li + lj)

B. Modified K-means

1. Select k data at random and make them centers of ini-tial clusters.

2. Let each data belong to the cluster whose center is thenearest.

3. Reassign a center for each cluster as follows. For eachdata in the cluster, compute the total sum of the dis-tances to all the data in the cluster, and let the one withthe smallest total be the new center.

4. If no chage in centers, then finish. else goto (2).


Identifying Temporal Patterns and Key Players in Document Collections

Benyah Shaparenko Rich Caruana Johannes Gehrke Thorsten JoachimsDepartment of Computer Science

Cornell UniversityIthaca, NY 14853

benyah, caruana, johannes, [email protected]

Abstract

This paper considers the problem of analyzing the devel-opment of a document collection over time without requir-ing meaningful citation data. Given a collection of time-stamped documents, we formulate and explore the followingtwo questions. First, what are the main topics and how dothese topics develop over time? Second, to gain insight intothe dynamics driving this development, what are the docu-ments and who are the authors that are most influential inthis process? Unlike prior work in citation analysis, we pro-pose methods addressing these questions without requiringthe availability of citation data. The methods use only thetext of the documents as input. Consequentially, they areapplicable to a much wider range of document collections(email, blogs, etc.), most of which lack meaningful citationdata. We evaluate our methods on the proceedings of theNeural Information Processing Systems (NIPS) conference.Even with the preliminary methods that we implemented,the results show that the methods are effective and that ad-dressing the questions based on the text alone is feasible. Infact, the text-based methods sometimes even identify influ-ential papers that are missed by citation analysis.

1. Introduction

Many document collections have grown through an in-teractive and time-dependent process. Earlier documentsshaped documents that followed later, with some documentsintroducing new ideas that lay the foundation for followingdocuments. Examples of such collections are email repos-itories, the body of scientific literature, and the web. Toaccess and analyze such collections, it is important to un-derstand how they developed. For example, consider a his-torian trying to get an understanding of the ideas and forcesleading to the Iraq war from news articles. Or, consider thehead of a hiring committee trying to understand which sci-entists had the greatest influence on the development of a

discipline.In this paper, we pose and consider the problem of ana-

lyzing the temporal development of a document collection.This problem requires simultaneously understanding whattopics are popular and which documents and authors drivethe changes in popularity of the topics. In particular, weaddress the following questions:

• What are the key topics in a collection of documentsand how did their popularity change over time?

• Which documents introduced new ideas that had largeimpact?

• Who were the authors that significantly drove the evo-lution of ideas?

To answer these questions for general document collections,we impose that our algorithms must work without meta-data augmenting the document representation. In particular,since most collections lack meaningful citation and hyper-link structure, the analysis must be done entirely based onthe text in the document.

Most existing work related to these questions has fo-cused on exploiting meta-data like hyperlinks and citationinformation. Graph-based algorithms like HITS [9], PageR-ank [15], and its descendants (see e.g. [3]) exploit infor-mation in the hyperlink structure to find outstanding doc-uments. These algorithms are based on citation-analysismethods from bibliometrics (see e.g. [12]) that are usedto detect related work and define impact [4, 5]. In contrastto using citation data, we propose complementary methodsthat use solely the text of the documents to make them ap-plicable beyond scientific literature and the web. To the bestof our knowledge, there is no existing method that uses onlythe text of the documents to determine the most influentialdocuments or authors.

For the problem of discovering topics and trends in a col-lection of documents, however, there is quite a body of workalready. The TDT evaluations (see e.g. [1, 2]) emphasizedonline new topic detection for news articles. Other work


has focused on burst detection, correlating real-world eventssuch as the rise and fall of a topic’s popularity with singlewords from the documents [20, 10]. Evolutionary themepatterns demonstrate the entire “life cycle” of a topic froma probabilistic background [13]. Other recent work presentsefficient algorithms specially designed for thread detection[6]. We build upon this work for visualizing the develop-ment of topics over time.

The main contribution of our work is the definition ofan interesting research problem, namely, how to identify adocument collection’s most influential documents and au-thors using only the text of the documents. We present onesuch method and show that this type of problem is in factfeasible and that even simple methods lead to interesting re-sults. In an empirical evaluation on a collection of scientificarticles, our method was able to identify influential docu-ments and authors successfully. In particular, we comparethe results with citation counts and find that the new identi-fication methods find papers with new and influential ideaseven in some cases where citation analysis fails. By usingthe information of which author wrote which document, wecan also determine who are the most influential authors ofthis document collection.

The paper is structured around the three questions fromabove, which we address in turn. After describing the re-lated work in more detail in Section 2, we introduce in Sec-tion 3 the data that is used as the testbed. In Section 4 wepresent the clusters/topics visualization, and Sections 5 and6 present our method for identifying influential documentsand authors, respectively.

2. Related Work

Our work identifies influential documents and authorsand provides a way of visualizing the topical developmentof a document collection. There is some work on identify-ing influential documents and authors – however, previouswork uses citations, not simply the text. There is much workon identifying trends in document collection. We review therelated work in the following.

2.1. Influential Documents and Authors

Since our main goal is identification of key documentsand authors, the most related work exists in the fields ofbibliometrics and citation analysis. Work in bibliometrics(e.g. [12]) uses citation analysis for a set of research pa-pers to determine the most influential authors and papers.It finds that the number of citations is the best predictor fora paper’s influence. Other bibliometric work has also con-sidered the issues of how to find leading documents and au-thors [14, 22]. Leading documents and authors can be foundby analyzing the citation graph.

McGovern et al. use the hubs and authorities algorithm[9] to identify authoritative documents, and then define au-thoritative authors as the authors who write several papersamong the most authoritative ones. Like this previous work,we seek to identify the authoritative authors and documents.However, our work is more general since we use only thetext of the documents instead of bibliographic information.Consequentially, we can also handle domains such as newsarticles where there are no formal citations and successfullyfind the leading documents and authors.

For finding leading documents in a hyperlinked envi-ronment, the classic algorithm is PageRank [15]. Used byGoogle, PageRank finds the most influential documents byconsidering the reputations of the documents in the collec-tion, and which documents link to which other documents.A document’s reputation is raised (or lowered) based on thenumber and reputation of the documents citing it. Moredocuments linking to a document mean that document en-joys greater popularity. Reputable documents linking to adocument mean that document should also be reputable.

Besides looking at the impact of individual documentsor authors, citation analysis has also tackled the problem offinding the journals with the most impact [4, 5]. Thoughnot without controversy, the impact factor uses citations tomeasure how important the articles within a journal are onaverage. In general, more citations means greater popu-larity. However, because of variables such as journal sizeand shifts in journal popularity over time, when comparedwith raw citation counts, the impact factor calculates a moreaccurate measure of the influence of a particular journal’spapers. This vein of work is similar to ours because theproblem formulation presented in our work generalizes togroups of documents and authors, not just individual docu-ments and authors. For example, one could think of rank-ing universities by their influence in a research community.Instead of using citation analysis, we could use the text pro-duced by the research groups in these universities to rankthe universities.

2.2. Temporal Topic/Trend Detection

Besides finding leading documents and authors, we addi-tionally present a visualization of the topics in a documentcollection. Related work in this area starts with early workon new topic detection. The TDT studies [1, 2] investigatedonline new topic detection for news articles. In some sense,the online version of the problem is harder than the one weconsider. We assume that we already have all the docu-ments in the collection with time stamps and that they canbe processed offline. Although both our work and the TDTwork both have as a goal topic detection, another differenceis that TDT focused on detecting the arrival of new topics,while our work focuses on providing an overview for how


the topical foci of a document collection changes over time.Independent component analysis (ICA) is another

method that can be used for similar purposes as solutionsto the TDT task. For example, ICA has been used to dis-tinguish topics in the CNN news chat room logs [11]. Likeour work, this usage of ICA is unsupervised and relies onlyon the text. After performing principal component analysis,the ICA algorithm distinguishes the main topics. This workgraphs graphs the existence of these topics over time, but itis hard to guage relative strength of the topics. Our workshows how topics rise and decrease in strength over time.

Our work bears more similarity to burst detection [10]and timeline creation [20]. Both burst detection and time-line creation seek to correlate real-world events with the textused in the document collections. There is an implicit as-sumption that as real-world events change, the text used inthe documents will change as well. Words that nobody usedat one time may become widely popular, e.g. words de-scribing a new technology or new idea. In the context ofresearch papers, when the burst detection code is run on aset of research papers, the bursts seem to correspond to therise and fall in popularity of research topics. By using astate machine approach, bursts can be detected in anythingfrom email, to Presidential State of the Union addresses, toresearch papers. The wide-ranging applications are possiblebecause, similar to our work, burst detection assumes onlytime-stamped text documents.

As in burst detection, recent work in thread detection hasproposed efficient, formal models [6]. These models do notdepend on a flat clustering or time-stamped documents, butinstead focus on using time in the algorithm. For exam-ple, instead of calculating all pairwise document similari-ties, this thread detection work only considers two docu-ments similar if they both contain the same term and occurwithin a set time window of each other. This work thereforedoes not suffer from one problem of flat clustering – that ofemphasizing cluster coherence at the loss of identifying de-veloping and changing strands of topics. Even though usingan algorithm specifically designed for temporal clusteringmay provide better clusters, our emphasis is on visualizingthe topics, not on the actual clustering method, so we justuse a simple flat clustering.

Another interesting direction of previous work thatworks with developing strands of research is in detectingevolutionary theme patterns. It is different from burst de-tection and our work in that the evolutionary theme patternsemphasize displaying the entire “life cycle” of a theme [13],while burst detection and our work simply considers a flatversion of clustering. Detecting evolutionary theme patternsnot only detects when a topic develops and fades, but alsowhat future or other topics may have been influenced by thistopic. In some sense, the theme evolution graphs presentways of depicting flows of ideas in the document collection

throughout time.

3. Data and Testbed

Before presenting the methods addressing the three ques-tions from above, we first discuss the type of data we areconsidering. We assume that the collection consists of doc-uments where:

• the text of the documents is accessible,

• the documents are time-stamped and assumed to arrivein (or can be grouped into) batches,

• and there are dependencies between earlier and laterdocuments.

Examples of such collections are email, proceedings of sci-entific conferences, scientific journals, news, and blogs.

As a testbed, we chose a collection of scientific articles,in particular the articles published in the proceedings ofthe Neural Information Processing Systems (NIPS) confer-ence [8] between 1987 and 2000. The reason for choosingthis data set is threefold. First, we believe that scientificdocument collections fulfill the assumptions stated above.Second, for scientific articles, citation data is available andwe can compare our methods against citation counts. Andthird, we are familiar with the development of this scientificcommunity, which allows us to evaluate the performance ofthe algorithms as an informed insider.

Since we consider fourteen years of research papers, weexpect to see several strong trends as topics develop andchange over time. This set of full text documents was ob-tained by OCR. There are a total of 1955 documents, withapproximately 100 documents in the first two years, andthen 150-160 documents each year in the last twelve years.We use only the text (not the citation or bibliographic infor-mation) from these documents. As meta-data, we use onlythe time-stamps (year) of the documents and the extractedauthor names of each article.

As our text-based representation, we chose a stan-dard vector-space approach [18]. In particular, we con-vert the text documents to a standard TFIDF (“ltc”)representation[19]. In this representation, the features arewords from the available text. We ignore stopwords andwords that only occur once, but consider all other words asfeatures. No stemming is used. To build a TFIDF vectorfor each document, we count the number of times term t

appeared in the document. Then we multiply by the IDFweighting factor of n

log(nt), where n is the number of docu-

ments in the corpus and nt is the number of documents thatcontain the term ti. To then determine the similarity of twodocuments, we use the standard cosine similarity betweenthe TFIDF vectors.


Cluster Descriptions: 3: policy, reinforcement, state, controller, action6: bayesian, mixture, gaussian, posterior, likelihood 2: image, images, object, objects, recognition5: chip, circuit, analog, voltage, vlsi 1: spike, cells, neurons, cell, firing4: speech, word, hmm, recognition, speaker 0: training, error, generalization, margin, hidden

Figure 1. Clusters proceed from cluster 0 on the bottom of the graph to cluster 6 on the top. (left)The distribution of k = 7 clusters. The histograms of each cluster are stacked on top of each other toshow the effects of cluster popularity over time. (right) The percentage distribution of k = 7 clusters.In this case, we normalize the histograms by the number of documents per year.

4. How do key topics change over time?

The first problem we consider is that of visualizing thekey topics of a document collection, and how the popularityof these topics develops over time. The goal is to providea concise summary of the high-level development of topicseven for large-scale document collections that are too ex-pensive to analyze manually. Following the flavor of ideasfrom ThemeRiver [7], we will summarize the developmentof topics using “Temporal Cluster Histograms.”

4.1. Method

Our method proceeds in three steps. In the first step,we determine the key topics in the document collection viaclustering. Each cluster represents a key topic. In the sec-ond step, a concise description of the key topic for eachcluster is formed. And in the final step, we visualize thetemporal behavior of topics as a flow through time indicat-ing increasing or decreasing popularity.

As the clustering algorithm in the first step we use k-means, in particular Weka’s [17] implementation. We mod-ified Weka for this application so that cosine distance couldbe used for k-means clustering. Since k-means may getstuck in local maxima, for each value of k, we chose 10random seeds and selected the clustering that had the leastsquared error.

To describe each cluster’s topic, we extract the five wordswith the highest weights in the cluster’s centroid. These five

words are the most important terms in defining the clustercentroid. The number five is somewhat arbitrary, but waschosen because we found that five words are sufficient toconvey a good sense of the cluster’s content without pre-senting an overwhelming amount of information. Using thetop five words allows us to reliably identify important termsdescribing the topic of a cluster.

Finally, we plot how topic popularity varies over time.For each year, we compute the number of documents thatfall into each cluster and plot each cluster’s yearly break-down as a stacked histogram. Using stacked histogramsclearly presents the changes in cluster size over time as aflow. Note that while the k-means clustering does not taketime into account when clustering the documents, this laststep relates clusters to time.

4.2. Results

Figure 2 shows the results of the method as applied to theNIPS data for k = 13. Most clusters directly represent top-ics and reflect our knowledge of the NIPS community verywell. In particular, clusters 10 and 11 clearly show the twoemerging research areas in NIPS, namely “Bayesian Meth-ods” and “Kernel Methods” like Support Vector Machines(SVMs). The graph correctly indicates that the topic ofBayesian analysis started before the kernel methods cluster,with both topics starting to dominate the NIPS conferencein 2000. Also, it correctly indicates that the Kernel Meth-ods topic strongly gained in popularity at that time. On the


Cluster Descriptions: 6: policy, reinforcement, action, state, agent12: chip, circuit, analog, voltage, vlsi 5: visual, eye, cells, motion, orientation11: kernel, margin, svm, vc, xi 4: units, node, training, nodes, tree10: bayesian, mixture, posterior, likelihood, em 3: code, codes, decoding, message, hints9: spike, spikes, firing, neuron, neurons 2: image, images, object, face, video8: neurons, neuron, synaptic, memory, firing 1: recurrent, hidden, training, units, error7: david, michael, john, richard, chair 0: speech, word, hmm, recognition, mlp

Figure 2. Clusters proceed from cluster 0 on the bottom of the graph to cluster 12 on the top. (left)The distribution of k = 13 clusters. The histograms of each cluster are stacked on top of each otherto show the effects of cluster popularity over time. (right) The percentage distribution of k = 13clusters. In this case, we normalize the histograms by the number of documents per year.

other hand, cluster 4 on supervised neural network training(e.g. feedforward neural networks), cluster 1 on recurrentneural networks, and cluster 8 on biologically-inspired neu-ral memories were very strong in the early years of NIPS,but by 2000 almost disappeared from the conference. Thisphenomenon also agrees with our prior perception of theNIPS conference.

The only cluster that does not represent a topic is clus-ter 7. This cluster groups together the outliers in the col-lection, which are not scientific papers, but other types ofdocuments. In particular, cluster 7 contains author indexes,subject indexes, the NIPS introductory page, and the start ofthe proceedings. In this case, the clustering helps clean thedata and identify outliers that do not fit any “content” topicclasses.

Our method of extracting keywords from the cluster cen-troids works reasonably well – many of the words are highlyinformative for the cluster content. The top five wordsshown give a reasonable description of the main topics inthe NIPS conference.

Figures 1, 2, and 3 show the results for all values ofk that we used, namely k = 7, 13, and 30, respectively.For the clusterings with more or fewer numbers of clusters,topics get merged and split in a reasonable fashion. Inter-estingly, the emerging clusters on Bayesian Methods andKernel Methods are rather homogeneous, and do not get

split even for large numbers of clusters. In Figure 3, with30 clusters, these two areas are still well-defined and seemto show similar behavior as the depiction with 13 clusters.These two clusters are very strong – even when there areonly 7 clusters as in Figure 1, these two topics still standout among all the rest (even though the Kernel Methods andNeural Nets clusters have been combined by the clusteringalgorithm).

Overall, we believe that the cluster analysis and its vi-sualization reflect correctly the development of the NIPSconference.

5. Which are the most influential documents?

Now that we have a way of visualizing clusters and de-termining how the topics developed over time, we wouldlike to identify the proponents driving these changes. Atthe first level, these proponents are documents and the ideasthey convey. Determining which documents are most influ-ential on later work gives insights into the ideas driving thechanges in the document collection over time. While someof this influence is conveyed through citations in the area ofscientific literature, the goal is a general solution that willwork for any text document. Consequentially, we restrictour methods to using only the text of the document, but usecitation data to evaluate the quality of our methods.


Clusters:29: auditory, sound, cochlear, speech, frequency 14: word, speech, speaker, recognition, words28: clustering, cluster, clusters, som, codebook 13: image, images, face, texture, wavelet27: theorem, vc, bounds, bound, ¡ 12: motor, eye, movement, movements, visual26: student, teacher, dynamics, replica, spin 11: kernel, margin, svm, kernels, adaboost25: spike, firing, spikes, neuron, neurons 10: bayesian, gaussian, posterior, mixture, likelihood24: object, tree, node, nodes, objects 9: routing, rod, bipolar, router, game23: cells, cell, cortical, cortex, neurons 8: memory, capacity, synaptic, associative, memories22: robot, controller, control, reinforcement, critic 7: david, michael, john, richard, chair21: obs, gradient, convergence, momentum, obd 6: policy, reinforcement, agent, action, state20: chip, circuit, analog, voltage, vlsi 5: motion, visual, velocity, orientation, direction19: vor, head, vestibular, eye, velocity 4: units, hidden, classifier, training, unit18: trajectory, units, hidden, weights, training 3: code, codes, decoding, hint, hints17: option, policy, portfolio, call, traffic 2: video, tracking, audio, image, camera16: tangent, td, distance, prototypes, simard 1: recurrent, state, units, hidden, network15: ica, blind, separation, sources, eeg 0: mlp, hmm, speech, ensemble, rbf

Figure 3. Clusters proceed from cluster 0 on the bottom of the graph to cluster 29 on the top. (left)The distribution of k = 30 clusters. The histograms of each cluster are stacked on top of each otherto show the effects of cluster popularity over time. (right) The percentage distribution of k = 30clusters. In this case, we normalize the histograms by the number of documents per year.

5.1. Method

We define the impact of a document as the amount offollowup work it generates. As a measure of influence of apaper on later work, we propose a lead/lag index. It is basedon the assumption that “imitation is the highest form of flat-tery,” i.e. if one document spawns a great deal of followupwork that uses similar vocabulary, then that document wasvery influential. In particular, the lead/lag index measureswhether a document is more of a leader or more of a fol-lower. We assume that leaders have many papers followingthem, and vice versa. The general idea is illustrated in Fig-ure 4. More formally, the index is defined as follows.

For each document d, we find the k nearest neighborsknn(d) in terms of the cosine distance between TFIDF vec-tors. We then count the number of neighbors that are pub-lished later than d

klater = |d′|(d′ ∈ knn(d)) ∧ (time(d′) > time(d))|

and the number of papers that precede the paper

kearlier = |d′|(d′ ∈ knn(d)) ∧ (time(d′) < time(d))| .

By comparing these two numbers, it is possible to determinethe degree to which a paper builds upon influential ideas vs.proposing new ideas that have influence on later documents.

The raw lead/lag index of a document d is computed bysubtracting the number klater of papers following the cur-rent paper in time from the number kearlier of papers pre-ceding the current paper in time.

Idraw = klater − kearlier

However, the index is strongly affected by edge effects. Forexample, kearlier is guaranteed to be zero for documentsfrom the first time step. To avoid such biases, we scale eachyear’s documents by normalizing it across all papers fromthe same time step. In particular, we subtract the average of


Rank Year Citations Paper Title and Author(s)1.167 1996 128 “improving the accuracy and speed of support vector machines”

chris j.c. burges, b. scholkopf1.128 1999 17 (466) “using analytic qp and sparseness to speed training of support vector machines”

john c. platt0.986 1999 18 “regularizing adaboost”

gunnar ratsch, takashi onoda, klaus-robert muller0.953 1996 41 (3711) “support vector method for function approximation, regression estimation, and signal

processing”vladimir vapnik, steven e. golowich, alex smola

0.945 1998 27 “training methods for adaptive boosting of neural networks”holger schwenk, yoshua bengio

0.945 1997 3 “modeling complex cells in an awake macaque during natural image viewing”william e. vinje, jack l. gallant

0.934 1998 17 “em optimization of latent-variable density models”c. m. bishop, m. svensen, c. k. i. william

0.934 1995 584 “a new learning algorithm for blind signal separation”s. amari, a. cichocki, h. h. yang

0.934 1995 16 “fast learning by bounding likelihoods in sigmoid type belief networks”t. jaakkola, l. k. saul., i. jordan

0.914 1998 49 “dynamically adapting kernels in support vector machines”nello cristianini, cohn campbell, john shawe-taylor

0.914 1999 27 “approximate learning of dynamic models”xavier boyen, daphne koller

Figure 5. Based on the lead/lag index, above is a list of the most influential NIPS papers whenconsidering the paper’s k = 14 nearest neighbors. According to our algorithm, these influentialpapers inspire the most followup work. We also provide the year of publication and the number ofcitations the papers received according to Google Scholar. Numbers in parentheses signify thatthere is a related publication by the same author(s) with similar content that receives most of thecitations.

the raw lead/lag indices for a year from each raw lead/lagindex in that year.

Idscaled =

1k

(Idraw − |di : time(di) = time(d)|∑

di:time(di)=time(d) Idiraw

)

The resulting scaled lead/lag index corrects for such edgeeffects. The higher the scaled lead/lag index, the more influ-ential the paper. Note that the scaled lead/lag index is alsonormalized with respect to k. This scaling process makesvalues from different choices of k comparable. The scaledlead/lag index scores typically fall in the interval from -1 to+1, with extremely strong papers receiving scores slightlyabove +1 and extremely lagging papers receiving scoresslightly below -1.

5.2. Results

We computed the scaled lead/lag index for the NIPS dataset. The value of k is the only parameter that needs to be

selected. With a small k, only the closest documents to aparticular document are considered. If a paper is very in-fluential, then other documents influenced by that paper aremissed in this analysis. On the other end of the spectrum, ifk is too large, documents that are only marginally affectedby a particular paper are included in the ranking. This canlead to noisier results. We run the experiments for k = 7,14, 24, and 49.

Figure 5 shows the results on the NIPS data for k = 14.Different values of k agreed more or less on which docu-ments are most influential, with only small changes in theordering among the top scoring papers. This indicates thatthe method is robust with respect to the choice of k, and thatmost reasonable values of k produce comparable results.

The list of the most leading NIPS papers computed byour algorithms closely reflects our insider perception of theNIPS conference. First, among the highest ranked papersare those presenting central new ideas on support vectormachines, which, as also evident from the Temporal Clus-ter Histograms, have had an outstanding influence on the


Time

Content

Figure 4. The main idea of the lead/lag index isto decide whether a paper is more of a leaderor follower based on whether similar paperscontent-wise follow or precede the paper inquestion. In this figure, graphed items repre-sent documents where the open circles areof one topic and the black dots are of an-other topic. The open circles (and black dots)towards the left of their respective clustersare leaders because similar documents fol-low these documents. On the other hand, thecircles (and dots) towards the right are follow-ers because documents with similar contentprecede these points.

development of the NIPS community. The first two SVMpapers published in NIPS are ranked first and fourth in ouralgorithm’s ranking, receiving recognition for being first inone of NIPS hottest topics.

Second, the results from our method are different fromwhat citation analysis would produce. The number of cita-tions of each paper according to Google Scholar is given inthe third column of Figure 5. (These counts measure the im-pact throughout all venues, whereas our ranking measuresthe impact within the NIPS community.) Our method re-flects the importance of ideas presented in the paper, nothow often this paper was cited. For example, the publica-tion by John Platt was the first refereed paper to propose theSMO algorithm for support vector machine training, whichhas become one of the standard methods for this problem.However, this paper has only 17 citations in Google Scholar,since most authors cite a book chapter with similar contentand 466 citations. Due to this, citation analysis would nothave recognized the importance and influence of the ideas inthis paper. Another example is Vapnik’s paper, which ranksfourth on this algorithm’s ranking. Although the NIPS pa-per has just 41 citations, other work by Vapnik on support

vector machines has many more citations (e.g. Vapnik’s firstbook has 3711 citations).

Third, while many of the papers in Figure 5 are in thearea of support vector machines, this dominant topic doesnot drown out influential ideas in other topic areas. An ex-ample is the paper by Amari et al., which is a fundamen-tal paper for the topic of independent component analysiswith 449 citations in Google Scholar. This paper is not onlyfundamental for independent component analysis, but as itturns out, it is also one of the most influential NIPS papersoverall. In fact, this paper is the second most often cited ofall NIPS papers.

In summary, the list of most influential papers agreeswell with our opinion of the most influential ideas in theNIPS corpus. These results validate our assumption thatmeasuring textual similarity provides an adequate methodfor determining which papers have influence on later papersin the document collection.

6. Who are the most influential authors?

Since papers do not write themselves, once we can de-termine the most leading documents, the next logical stepis to ask who wrote them. Given a collection of documents,we would like to answer the questions of which authors pro-duce the most original work, which authors are most influ-ential in spreading their ideas, and which authors determinethe pulse of the field and future directions of research.

6.1. Method

The document lead/lag index already provides a methodfor determining the influence of a document. To identifythe most influential authors in the document collection, wecan aggregate the document lead/lag information by author.Specifically, we address the following question: Which au-thors write documents that have a significantly high scaledlead/lag index?

To aggregate the lead/lag index scores by author, wecompute the 95% confidence interval around the averagelead/lag scores for each author. We then rank the au-thors by the lower 95% confidence bound. More specif-ically, consider an author with n papers receiving scaledlead/lag scores L

d1scaled, · · · , Ldn

scaled. For these scores, onecan compute the confidence interval for the mean m fromthe sample variance v under assumption of normality asm± 2 ∗

√v

n . However, this confidence interval is quite sen-sitive to anomalies for small samples. For example, one au-thor may have two papers with medium rank and identicalmeans. Then, the author will receive an excellent score be-cause the variance estimate is zero. To smooth the varianceestimate and reduce this problem, we add an extra documentwith weight −1 (a weight which is near the bottom of the


lead/lag rankings) to the list of docs. With the new mean m′

and variance v′, the lead/lag index of an author is

Ia = m′ − 2 ∗√

v′

n + 1. (1)

6.2. Results

We computed the author lead/lag index for all authorsin the NIPS collection. Figure 6 has the results for k =14. Results for different values of k are similar, so we onlypresent k = 14. Overall, we find that this ranking for themost part identifies a document collection’s key players.

From bibliometrics, we know that typically the best pre-dictor of an author’s importance is the number of citationsthat author receives [12]. Therefore, we compare our ag-gregated author lead/lag ranking to the number of citationsan author has received on Google Scholar. (We searchedby the author’s name and added citation counts for the first200 documents by that author or as many documents as hadcitations.) Note, however, that the citation counts measurethe impact of an author in all venues, while our lead/lag in-dex measures the impact in NIPS. Additionally, we presentnumbers for how prolific an author is, measured by the num-ber of papers the author has published in NIPS. The authorwith the most papers is Terrence Sejnowski, who has 46.We find that highly-cited authors typically rank high in theauthor lead/lag index. For example, Michael Jordan has of-ten published influential work, and the algorithm recognizesthis by ranking him at the top of the list. Of the 1931 authorsin NIPS, the 20 most influential authors according to the au-thor lead/lag index (Figure 6) in general have a significantnumber of publications. Authors that are lower down inthe ranking do not have nearly this number of publications.The authors that the aggregated lead/lag index identified forthe most part represent well-known, leading names in theNIPS community. Therefore, we believe that aggregatingthe lead/lag index by author leads to a meaningful rankingof an author’s influence on following work.

By and large, the algorithm works quite well in identi-fying key, influential authors. There are just 2 cases out ofthe top 20 where the authors do not have many citations.As it turns out, in both cases, the reason is an artifact of thedata used, not our method of computing the author lead/lagindex. Since the NIPS data set is obtained by OCR, weused an automated string match process to match the names(same first initial and last name within edit distance of 2).For both ”D. D. Coon” and ”Harrison Monfook Leong,” theabove name match heuristic combines many names. Thename ”D. D. Coon” here in fact represents many authorswith short names. Similarly, the name ”Leong” has manysimilar names in the NIPS data. With a perfect list of whoauthored which documents, this phenomenon would disap-pear. Therefore, we conclude that our method works as ex-

pected, producing a list of well-known, well-published au-thors.

7. Summary

We propose the problem of analyzing the temporal de-velopment of document collections for which there is nomeaningful citation data available. As proof of concept, wepropose simple methods that show that this problem is feasi-ble and interesting. Unlike existing approaches from biblio-metrics, the new methods are applicable even if no citationor hyperlink data is available. Using the proceedings of theNIPS conference as a testbed, Temporal Cluster Histogramswere found to give an accurate and concise summary of thepopularity of topics over time. To identify the papers withlargest influence on topic development, we defined a doc-ument lead/lag index that is an effective indicator of theinfluence of a document. Finally, we extended the influ-ence analysis to authors by aggregating document lead/lagindices. These lead/lag scores are the first measures able toidentify key authors and documents in collections that lackcitation information.

We believe that temporal analysis of document collec-tions is an exciting area that deserves future research. Themethods presented in this paper give evidence that suchanalyses are possible even without citation information.However, more principled approaches are likely to be evenmore accurate and could provide more meaningful insights.For example, currently there is no way to associate influen-tial documents with clusters in the Temporal Cluster His-tograms. It would be interesting to identify the set of papersthat are responsible for spawning a new topic cluster. Simi-larly, it would be interesting to design specialized clusteringalgorithms that directly capture the splitting and merging oftopics over time to get an overview of the “flow” of ideas inthe collection. We are planning to explore these questionsin future work.

References

[1] J. Allan, J. Carbonell, G. Doddington, J. Yamron, andY. Yang. Topic Detection and Tracking Pilot Study: FinalReport. In Proceedings of the DARPA Broadcast News Tran-scription and Understanding Workshop-1998, 1998.

[2] J. Allan, R. Papka, and V. Lavrenko. On-Line New EventDetection and Tracking. In Research and Development inInformation Retrieval, pages 37–45, 1998.

[3] D. Cohn and H. Chang. Learning to Probabilistically Iden-tify Authoritative Documents. In Proceedings of the 17thICML, pages 167–174, Morgan Kaufmann, San Francisco,CA, 2000.

[4] E. Garfield. The Impact Factor.http://www.isinet.com/essays/journalcitationreports/7.html/.


Author Rank Papers Citations Author Rank Papers Citationsjordan, michael i. 0.037 27 9284 bengio, yoshua -0.131 18 1805

smola, alex -0.004 13 3038 saad, david -0.133 11 694scholkopf, b. -0.022 10 5338 bialek, william -0.135 11 1547

atkeson, christopher g. -0.06 10 3378 dayan, peter -0.138 24 4014williams, christophe k.i. -0.067 16 1605 ghahramani, zoubin -0.142 14 3171

sejnowski, terrence j. -0.069 46 13955 shawe-taylor, john -0.158 9 4014hinton, geoffrey e. -0.075 27 11643 tresp, volker -0.162 16 672jaakkola, tommi -0.091 10 2918 sollich, peter -0.173 9 739

miller, kenneth d. -0.106 11 2447 barto, a.g. -0.175 12 7100coon, d. d. -0.112 21 531 leong, harrison monfook -0.196 15 0

Figure 6. The above list contains the NIPS authors with the highest ranking in the author lead/lagindex. From considering each paper’s k = 14 nearest neighbors for the document lead/lag index andthen aggregating the document lead/lag index by author, our algorithm produces a ranking for howgroundbreaking an author’s work is. Since citation analysis has shown that the number of citationsan author receives is typically the best estimate for an author’s importance, we provide GoogleScholar citation counts for these authors. Additionally, we provide counts for how many NIPS papersthese authors have published.

[5] E. Garfield. The Meaning of the Impact Factor. Interna-tional Journal of Clinical and Health Psychology, 3(2):363–369, 2003.

[6] R. Guha, D. Sivakumar, R. Kumar, and R. Sundaram. Un-weaving a Web of Documents. In Proceedings of KDD-2005, Chicago, Illinois, 2005.

[7] S. Havre, B. Hetzler, and L. Nowell. ThemeRiver: In Searchof Trends, Patterns, and Relationships. IEEE Transactionson Visualization and Computer Graphics, 2002.

[8] http://nips.djvuzone.org/txt.html. NIPS Online: The TextRepository.

[9] J. Kleinberg. Authoritative Sources in a Hyperlinked Envi-ronment. Journal of the ACM, 46(5):604–632, 1999.

[10] J. Kleinberg. Bursty and Hierarchical Structure in Streams.In Proceedings of KDD-2002, Edmonton, Alberta, Canada,2002.

[11] T. Kolenda, L. K. Hansen, and J. Larsen. Signal Detec-tion using ICA: Application to Chat Room Topic Spotting.In Lee, Jung, Makeig, and Sejnowski, editors, Proc. of theThird International Conference on Independent ComponentAnalysis and Signal Separation (ICA2001), pages 540–545,San Diego, CA, USA, 2001.

[12] A. McGovern, L. Friedland, M. Hay, B. Gallagher, A. Fast,J. Neville, and D. Jensen. Exploiting Relational Structure toUnderstand Publication Patterns in High-Energy Physics. InProceedings of KDD-2003, Washington, DC, 2003.

[13] Q. Mei and C. Zhai. Discovering Evolutionary Theme Pat-terns from Text - An Exploration of Temporal Text Mining.In Proceedings of KDD-2005, Chicago, Illinois, 2005.

[14] F. Osareh. Bibliometrics, Citation Analysis and Co-citationAnalysis: A Review of Literature I. Libri, 46:149–158,1996.

[15] L. Page, S. Brin, R. Motwani, and T. Wingrad. The PageR-ank Citation Ranking: Bringing Order to the Web. Technicalreport, Stanford University, 1999.

[16] A. Popescul, G. W. Flake, S. Lawrence, L. H. Ungar, andC. L. Giles. Clustering and Identifying Temporal Trendsin Document Databases. In IEEE Advances in Digital Li-braries ADL-2000, pages 173–182, Washington, DC, 2000.

[17] P. Reutemann, B. Pfahringer, and E. Frank. Proper: AToolbox for Learning from Relational Data with Proposi-tional and Multi-Instance Learners. In Proceedings of the17th Australian Joint Conference on Artificial Intelligence.Springer-Verlag, 2004.

[18] G. Salton. Developments in Automatic Text Retrieval. Sci-ence, 25(3):974–979, 1991.

[19] G. Salton and C. Buckley. Weighting Approaches in Auto-matic Text Retrieval. Information Processing and Manage-ment, 24(5):513–523, 1988.

[20] R. Swan and D. Jensen. TimeMines: Constructing Timelineswith Statistical Models of Word Usage. In Proceedings ofKDD-2000, pages 73–80, Boston, MA, 2000.

[21] F. B. Viegas, M. Wattenberg, and K. Dave. Studying Coop-eration and Conflict between Authors with history flow Vi-sualizations. In Proceedings of CHI-2004, Vienna, Austria,2004.

[22] H. D. White. Citation Analysis and Discourse Analysis Re-visited. Applied Linguistics, 25(1):89–116, 2004.


Published byDepartment of Mathematics and Computing Science

Technical Report Number: 2003-02 October, 2003 ISBN: 0-9734039-0-X


Temporal data mining: algorithms, theory and applications (TDM...

Documents

Transcript of Temporal data mining: algorithms, theory and applications (TDM...