Hierarchical TimeSeries Clustering for Data Streams
Transcript of Hierarchical TimeSeries Clustering for Data Streams
Hierarchical TimeSeries Clustering forData Streams
Pedro Rodrigues, João Gama and João Pedro Pedroso
Computer Science Department Faculty of ScienceArtificial Intelligence and Computer Science Laboratory
University of Porto
Overview
● Motivation● Related Work● Divisive Analysis Clustering (DIANA)● Online DivisiveAgglomerative Clustering (ODAC)
– Behaviour– Structure– Algorithm Criteria
● Splitting● Split Support● Aggregate
● Experimental Work● Discussion and Future Work● Summary
Pedro Rodrigues 2
Motivation
● Many modern databases consist of continuously stored data from unclassified timeseries– Power systems, financial market, web logs, network routers, etc...
● Environment is often so dynamic that our models must be always adapting
● Our goal is to design a system that:
– hierarchically cluster timeseries (“whole” clustering);
● defines clusters of variables● one has no need to predefine the number of clusters
– is built incrementally, working online;
– dynamically adapts to new data;
– basically... works! ;)
Pedro Rodrigues 3
Related Work
● Parametric Clustering
– Reconstructive models (tend to minimize a cost function)
● Kmeans, Kmedians, Simulated Annealing, ...
– Generative models (assume instances are observations from a set of K unknown distributions
● Gaussian Mixture Model using ExpectationMaximization, CMeans Fuzzy, ...
● Nonparametric Clustering (hierarchical models)
– usually based on dissimilarities between elements of the same cluster
– either agglomerative (AGNES) or divisive (DIANA)
● Data Streams
– VFDT, VFDTc, UFFT, VFML...
Pedro Rodrigues 4
Divisive Analysis Clustering (DIANA)
● Starts with one large cluster containing all timeseries
● At each step the largest cluster is divided in two
● Stop when all clusters contain only one timeseries
● Keep heights of splitting to construct a dendrogram
Pedro Rodrigues 5
EUNITE dataset20 variables15973 examples1corr dissimilarity
Online DivisiveAgglomerative Clustering (ODAC)
● ODAC main characteristics:– Expand Structure
● divide clusters
– Contract Structure
● aggregate clusters
– Other Issues
● topdown strategy● incremental● works online● any time cluster definition● single scan on data
Pedro Rodrigues 6
EUNITE dataset20 variables15973 examples1corr dissimilarity
ODAC Behaviour
Pedro Rodrigues 7
ODAC Behaviour
Pedro Rodrigues 8
ODAC Behaviour
Pedro Rodrigues 9
Dissimilarity Measure
● DIANA uses real dissimilarity between timeseries
● We could benefit from a ranged measure...
Pedro Rodrigues 10
d a , b=∑i=1
n ∣a i−bi∣n
corr a , b=∑i=1
n
a i b i−n ab
∑i=1
n
a i2−n a
2∑i=1
n
b i2−n b2
Incremental Correlation
● We can see that the sufficient statistics needed to compute correlation on the fly are...
Pedro Rodrigues 11
A=∑i=1
n
ai
B=∑i=1
n
bi
A2=∑i=1
n
ai2
B2=∑i=1
n
bi2
AB=∑i=1
n
ai bi
N=n
corr N a , b=AB−
A.BN
A2−A2
N B2−B2
N
corr a , b=∑i=1
n
ai bi−na b
∑i=1
n
a i2−na
2∑i=1
n
bi2−nb2
Dissimilarity Measure Diameter
● We use as dissimilarity measure between timeseries a and b, at n examples:
● As in DIANA, we consider the highest dissimilarity between two timeseries belonging to the same cluster as the cluster's diameter.
Pedro Rodrigues 12
d na , b=1−corr N a , bd n :ℕ×ℕ[0,2 ]ℝ
∑i=1
n
ai2 ∑
i=1
n
ai bi ∑i=1
n
ai c i
∑i=1
n
ai bi ∑i=1
n
bi2 ∑
i=1
n
bi c i
∑i=1
n
ai c i ∑i=1
n
bi c i ∑i=1
n
c i2
ODAC Structure
Pedro Rodrigues
C1
C2
C4 C5
C3
abc
a b c0 d a , b d a ,c
d a , b 0 d b , cd a , c d b , c 0
n ,∑i=1
n
ai ,∑i=1
n
bi ,∑i=1
n
c i
n ,∑i=1
n
ai ,∑i=1
n
bi
n ,∑i=1
n
c i ,∑i=1
n
c i2
n ,∑i=1
n
bi ,∑i=1
n
bi2
n ,∑i=1
n
ai ,∑i=1
n
a i2
∑i=1
n
ai2 ∑
i=1
n
a i bi
∑i=1
n
ai bi ∑i=1
n
bi2
13
mm−12
data cellsm m1
2data cells
m m12
data cells
Splitting Criteria
● DIANA always splits clusters into single objects
● ODAC splits only when we have confidence on a good decision: Hoeffding bound.
● For n independent observations of variable with mean and range R, the Hoeffding bound states that with probability the true mean of the variable is at least , where
Pedro Rodrigues 14
n= R2 ln 1/2 n
vk−n
1−vk
vk
Splitting Criteria
● Let be the largest current cluster on the system. We use an improved version of the splitting rule from DIANA:
● Rank variables by average dissimilarity ( ), in descent order, ex:
● Following the Hoeffding bound, we choose to split this cluster if
ensuring, with confidence , that this difference is significant.
● But this is not all... what if the two most dissimilar have the same average dissimilarity? The cluster would never be split!
Pedro Rodrigues 15
d nikv i∈Ck
Ck
d nak d nck d nbk
d nak− d nckn
1−
Splitting Criteria Enhanced Dissimilarity Space View
Pedro Rodrigues 16
oooooooo
o
DIANA HBRule ODAC
oooooooo
oooooooo
o
o
● If then we test
● If this is true, then we move both variables a and c to the new cluster and then test for the other variables.
● If not, just follow the ranking until a cut point is found, or no split will occur.
d nak− d ncknd k c− d k bn
Splitting Criteria Enhanced
Pedro Rodrigues 17
● After a split point has been detected, DIANA changes to the new cluster those variables that are closer, in average, to the splinter group then to the remaining group.
● ODAC has a different perspective:Are we confident that the other variables should move to the new cluster along those already moved?
● Move variable b to new cluster if
d nbk− d nbsn
C s
d bk−d bs0
Split Support Criteria
● After a split, we only keep the new divided structure if the change really improves a quality measure, the Divisive Coefficient.
● Let dd(i) be the diameter of the last cluster Ck to which variable v
i
belonged, divided by the global diameter. The divisive coefficient of a cluster definition clust is
● A new cluster definition clust2 is kept only if
ensuring, with confidence , that the new structure is better (according to the quality measure, of course) than the previous.
Pedro Rodrigues 18
DCclust=2 1−∑i=1
m
dd i/n
DCclust2−DCclustn
1−
● If at some point, no split occur in any of the existent clusters, we proceed to agglomerative clustering.
● Let be the smallest current cluster of the definitionclust
.
● Assume 's parent is a leaf and compute corresponding
● Following the Hoeffding bound, we choose this cluster definition if
assuming no confidence that this difference is significant.
● All of 's parent's children are pruned!
C k
Aggregate Criteria
Pedro Rodrigues 19
Ck
DCclust−DCclust2n
DCclust2
Ck
Experimental Work
Pedro Rodrigues 20
{{
{
{
{
{{
ODACDIANA
1.56 1.38
Discussion and Future Work
● The system is built incrementally and behaves dynamically;Works online, giving an any time cluster definition;
● The system stores all information about variables since it started;Concept drift will clarify the notion of learn and forget;
● Some times the system stalls at a given cluster, dividing and aggregating consecutively;
● Benefits from saving computations are still to assure as the divisive coefficient is calculated every time a split support or aggregate decision must be made;
● Introducing new timeseries along the iterations is not yet implemented but is on the forge...
Pedro Rodrigues 21
Summary
● ODAC a new algorithm to incrementally cluster online timeseries in data streams:– hierarchically cluster timeseries (“whole” clustering);
– topdown strategy;
– is built incrementally, working online;
– dynamically adapts to new data: dividing and aggregating;
– any time cluster definition;
– single scan on data;
– basically... works ;)...but should and will (!) work better!
– What is missing in ODAC?
Pedro Rodrigues 22
You tell us, please!
Thank You!
Pedro Rodrigues, João Gama and João Pedro Pedroso
Computer Science Department Faculty of ScienceArtificial Intelligence and Computer Science Laboratory
University of Porto
DIANA Algorithm
Pedro Rodrigues 24
1. Select cluster Ck with highest diameter
2. Find the object with highest average dissimilarity to all other objects and start a new cluster C
s (splinter group)
with this object.
3. For each object compute
4. Find the object h with largest difference Dh. If D
h > 0 then h is,
on average, closer to the splinter group than to the old group, so move h to C
s.
5. Repeat steps 3. and 4. until all Di are negative
6. If there is a cluster Ck with #C
k > 1 then goto 1.
s∈Ck
i∈C k
i∈Ck∖C s
Di={averagej
d i , j , j∉C s}−{averagej
d i , j , j∈C s}
ODAC Algorithm
Pedro Rodrigues 25
1. Get next examples
2. Update and propagate sufficient statistics for all variables
3. Compute the Hoeffding bound
4. Choose next cluster Ck in descending order of diameters
5. TestSplit() in cluster Ck
6. If we found a split point, goto 12. with new cluster tree
7. If still exists a cluster Ck not yet tested for splitting goto 4.
8. Choose next cluster Ck in ascending order of diameters
9. TestAggregate() in cluster Ck
10. If we found an aggregation then goto 12. with new cluster tree
11. If still exists a cluster Ck not yet tested for aggregation goto 8.
12. If not end of data, goto 1.
nmin
∈