Hierarchical TimeSeries Clustering for Data Streams

Hierarchical TimeSeries Clustering forData Streams

Pedro Rodrigues, João Gama and João Pedro Pedroso

Computer Science Department Faculty of ScienceArtificial Intelligence and Computer Science Laboratory

University of Porto

Overview

● Motivation● Related Work● Divisive Analysis Clustering (DIANA)● Online DivisiveAgglomerative Clustering (ODAC)

– Behaviour– Structure– Algorithm Criteria

● Splitting● Split Support● Aggregate

● Experimental Work● Discussion and Future Work● Summary

Pedro Rodrigues 2

Motivation

● Many modern databases consist of continuously stored data from unclassified timeseries– Power systems, financial market, web logs, network routers, etc...

● Environment is often so dynamic that our models must be always adapting

● Our goal is to design a system that:

– hierarchically cluster timeseries (“whole” clustering);

● defines clusters of variables● one has no need to predefine the number of clusters

– is built incrementally, working online;

– dynamically adapts to new data;

– basically... works! ;)

Pedro Rodrigues 3

Related Work

● Parametric Clustering

– Reconstructive models (tend to minimize a cost function)

● Kmeans, Kmedians, Simulated Annealing, ...

– Generative models (assume instances are observations from a set of K unknown distributions

● Gaussian Mixture Model using ExpectationMaximization, CMeans Fuzzy, ...

● Nonparametric Clustering (hierarchical models)

– usually based on dissimilarities between elements of the same cluster

– either agglomerative (AGNES) or divisive (DIANA)

● Data Streams

– VFDT, VFDTc, UFFT, VFML...

Pedro Rodrigues 4

Divisive Analysis Clustering (DIANA)

● Starts with one large cluster containing all timeseries

● At each step the largest cluster is divided in two

● Stop when all clusters contain only one timeseries

● Keep heights of splitting to construct a dendrogram

Pedro Rodrigues 5

EUNITE dataset20 variables15973 examples1corr dissimilarity

Online DivisiveAgglomerative Clustering (ODAC)

● ODAC main characteristics:– Expand Structure

● divide clusters

– Contract Structure

● aggregate clusters

– Other Issues

● topdown strategy● incremental● works online● any time cluster definition● single scan on data

Pedro Rodrigues 6

EUNITE dataset20 variables15973 examples1corr dissimilarity

ODAC Behaviour

Pedro Rodrigues 7

ODAC Behaviour

Pedro Rodrigues 8

ODAC Behaviour

Pedro Rodrigues 9

Dissimilarity Measure

● DIANA uses real dissimilarity between timeseries

● We could benefit from a ranged measure...

Pedro Rodrigues 10

d a , b=∑i=1

n ∣a i−bi∣n

corr a , b=∑i=1

n

a i b i−n ab

∑i=1

n

a i2−n a

2∑i=1

n

b i2−n b2

Incremental Correlation

● We can see that the sufficient statistics needed to compute correlation on the fly are...

Pedro Rodrigues 11

A=∑i=1

n

ai

B=∑i=1

n

bi

A2=∑i=1

n

ai2

B2=∑i=1

n

bi2

AB=∑i=1

n

ai bi

N=n

corr N a , b=AB−

A.BN

A2−A2

N B2−B2

N

corr a , b=∑i=1

n

ai bi−na b

∑i=1

n

a i2−na

2∑i=1

n

bi2−nb2

Dissimilarity Measure Diameter

● We use as dissimilarity measure between timeseries a and b, at n examples:

● As in DIANA, we consider the highest dissimilarity between two timeseries belonging to the same cluster as the cluster's diameter.

Pedro Rodrigues 12

d na , b=1−corr N a , bd n :ℕ×ℕ[0,2 ]ℝ

∑i=1

n

ai2 ∑

i=1

n

ai bi ∑i=1

n

ai c i

∑i=1

n

ai bi ∑i=1

n

bi2 ∑

i=1

n

bi c i

∑i=1

n

ai c i ∑i=1

n

bi c i ∑i=1

n

c i2

ODAC Structure

Pedro Rodrigues

C1

C2

C4 C5

C3

abc

a b c0 d a , b d a ,c

d a , b 0 d b , cd a , c d b , c 0

n ,∑i=1

n

ai ,∑i=1

n

bi ,∑i=1

n

c i

n ,∑i=1

n

ai ,∑i=1

n

bi

n ,∑i=1

n

c i ,∑i=1

n

c i2

n ,∑i=1

n

bi ,∑i=1

n

bi2

n ,∑i=1

n

ai ,∑i=1

n

a i2

∑i=1

n

ai2 ∑

i=1

n

a i bi

∑i=1

n

ai bi ∑i=1

n

bi2

13

mm−12

data cellsm m1

2data cells

m m12

data cells

Splitting Criteria

● DIANA always splits clusters into single objects

● ODAC splits only when we have confidence on a good decision: Hoeffding bound.

● For n independent observations of variable with mean and range R, the Hoeffding bound states that with probability the true mean of the variable is at least , where

Pedro Rodrigues 14

n= R2 ln 1/2 n

vk−n

1−vk

vk

Splitting Criteria

● Let be the largest current cluster on the system. We use an improved version of the splitting rule from DIANA:

● Rank variables by average dissimilarity ( ), in descent order, ex:

● Following the Hoeffding bound, we choose to split this cluster if

ensuring, with confidence , that this difference is significant.

● But this is not all... what if the two most dissimilar have the same average dissimilarity? The cluster would never be split!

Pedro Rodrigues 15

d nikv i∈Ck

Ck

d nak d nck d nbk

d nak− d nckn

1−

Splitting Criteria Enhanced Dissimilarity Space View

Pedro Rodrigues 16

oooooooo

o

DIANA HBRule ODAC

oooooooo

oooooooo

o

o

● If then we test

● If this is true, then we move both variables a and c to the new cluster and then test for the other variables.

● If not, just follow the ranking until a cut point is found, or no split will occur.

d nak− d ncknd k c− d k bn

Splitting Criteria Enhanced

Pedro Rodrigues 17

● After a split point has been detected, DIANA changes to the new cluster those variables that are closer, in average, to the splinter group then to the remaining group.

● ODAC has a different perspective:Are we confident that the other variables should move to the new cluster along those already moved?

● Move variable b to new cluster if

d nbk− d nbsn

C s

d bk−d bs0

Split Support Criteria

● After a split, we only keep the new divided structure if the change really improves a quality measure, the Divisive Coefficient.

● Let dd(i) be the diameter of the last cluster Ck to which variable v

i

belonged, divided by the global diameter. The divisive coefficient of a cluster definition clust is

● A new cluster definition clust2 is kept only if

ensuring, with confidence , that the new structure is better (according to the quality measure, of course) than the previous.

Pedro Rodrigues 18

DCclust=2 1−∑i=1

m

dd i/n

DCclust2−DCclustn

1−

● If at some point, no split occur in any of the existent clusters, we proceed to agglomerative clustering.

● Let be the smallest current cluster of the definitionclust

.

● Assume 's parent is a leaf and compute corresponding

● Following the Hoeffding bound, we choose this cluster definition if

assuming no confidence that this difference is significant.

● All of 's parent's children are pruned!

C k

Aggregate Criteria

Pedro Rodrigues 19

Ck

DCclust−DCclust2n

DCclust2

Ck

Experimental Work

Pedro Rodrigues 20

{{

{

{

{

{{

ODACDIANA

1.56 1.38

Discussion and Future Work

● The system is built incrementally and behaves dynamically;Works online, giving an any time cluster definition;

● The system stores all information about variables since it started;Concept drift will clarify the notion of learn and forget;

● Some times the system stalls at a given cluster, dividing and aggregating consecutively;

● Benefits from saving computations are still to assure as the divisive coefficient is calculated every time a split support or aggregate decision must be made;

● Introducing new timeseries along the iterations is not yet implemented but is on the forge...

Pedro Rodrigues 21

Summary

● ODAC a new algorithm to incrementally cluster online timeseries in data streams:– hierarchically cluster timeseries (“whole” clustering);

– topdown strategy;

– is built incrementally, working online;

– dynamically adapts to new data: dividing and aggregating;

– any time cluster definition;

– single scan on data;

– basically... works ;)...but should and will (!) work better!

– What is missing in ODAC?

Pedro Rodrigues 22

You tell us, please!

Thank You!

Pedro Rodrigues, João Gama and João Pedro Pedroso

Computer Science Department Faculty of ScienceArtificial Intelligence and Computer Science Laboratory

University of Porto

DIANA Algorithm

Pedro Rodrigues 24

1. Select cluster Ck with highest diameter

2. Find the object with highest average dissimilarity to all other objects and start a new cluster C

s (splinter group)

with this object.

3. For each object compute

4. Find the object h with largest difference Dh. If D

h > 0 then h is,

on average, closer to the splinter group than to the old group, so move h to C

s.

5. Repeat steps 3. and 4. until all Di are negative

6. If there is a cluster Ck with #C

k > 1 then goto 1.

s∈Ck

i∈C k

i∈Ck∖C s

Di={averagej

d i , j , j∉C s}−{averagej

d i , j , j∈C s}

ODAC Algorithm

Pedro Rodrigues 25

1. Get next examples

2. Update and propagate sufficient statistics for all variables

3. Compute the Hoeffding bound

4. Choose next cluster Ck in descending order of diameters

5. TestSplit() in cluster Ck

6. If we found a split point, goto 12. with new cluster tree

7. If still exists a cluster Ck not yet tested for splitting goto 4.

8. Choose next cluster Ck in ascending order of diameters

9. TestAggregate() in cluster Ck

10. If we found an aggregation then goto 12. with new cluster tree

11. If still exists a cluster Ck not yet tested for aggregation goto 8.

12. If not end of data, goto 1.

nmin

∈

Hierarchical TimeSeries Clustering for Data Streams

Documents

Transcript of Hierarchical TimeSeries Clustering for Data Streams