Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf ·...
Transcript of Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf ·...
![Page 1: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/1.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Data Streaming for Autonomic Computing in the
EGEE framework
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag
TAO − INRIA CNRSUniversite de Paris-Sud, F-91405 Orsay Cedex, France
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 2: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/2.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Contents
1 MotivationMotivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
2 Hierarchical AP (Hi-AP): Clustering Large-scale DataHi-AP AlgorithmHi-AP Application on EGEE Grid logs
3 StrAP : Clustering Streaming DataChallenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)
A StrAP-based Real-time Online Grid Monitoring System
4 Conclusion and Perspectives
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 3: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/3.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
Contents
1 MotivationMotivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
2 Hierarchical AP (Hi-AP): Clustering Large-scale DataHi-AP AlgorithmHi-AP Application on EGEE Grid logs
3 StrAP : Clustering Streaming DataChallenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)
A StrAP-based Real-time Online Grid Monitoring System
4 Conclusion and Perspectives
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 4: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/4.jpg)
Motivations of Autonomic Computing
![Page 5: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/5.jpg)
Goals of Autonomic Computing
AUTONOMIC VISION & MANIFESTOhttp://www.research.ibm.com/autonomic/manifesto/
Self-managing system with the ability of
Self-healing: detect, diagnose and repair problems
Self-configuring: automatically incorporate and configurecomponents
Self-optimizing: ensure the optimal functioning wrt definedrequirements
Self-protecting: anticipate and defend against securitybreaches
Data Mining for Autonomic Computing
![Page 6: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/6.jpg)
Autonomic Grid Computing System
EGEE: Enabling Grids for E-sciencE, http://www.eu-egee.orgEGEE User Forum: annual event since 2007
![Page 7: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/7.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
Job stream monitoring by clustering
Goal: summarizing the large scale and fast arriving data.
provide compact description
help to find out interesting patterns
classify the incoming data
Challenges:
Large sizesave all the data and process them as a whole ?require huge disk, CPU, and memory (impossible for data insize of GB, TB, even PB, ..)process the data part by part ?how to guarantee the global optimization.
Changing distribution:for the time-ordered data, how to make the clusters keep tracking
the evolving data?
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 8: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/8.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
What is Clustering ?
unsupervised learning method
group similar points together in the same group (cluster)
widely used on various problems:Interesting groups discovery, Data structure presentation, Data
classification, Data compression, Dimensionality reduction or feature
selection
many clustering methods are available, e.g., Hierarchical
clustering methods, Density-based methods(Dbscan), Partitioning
methods(k-means)
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 9: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/9.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
Our requirements of clustering method
No need to set the number K of clusters double-edged sword
global optimization of clustering result:not locally optimized by greedy approach
stable clustering result:not affected by the initialization
real data points as representative exemplars (cluster center):suit the application field when averaged centers are meaningless,
e.g. molecule, jobs described by categorical attributes
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 10: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/10.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
Our requirements of clustering method
No need to set the number K of clusters double-edged sword
global optimization of clustering result:not locally optimized by greedy approach
stable clustering result:not affected by the initialization
real data points as representative exemplars (cluster center):suit the application field when averaged centers are meaningless,
e.g. molecule, jobs described by categorical attributes
Affinity Propagation (AP) (Frey & Dueck, Science2007)
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 11: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/11.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
Iterations of Message passing in AP
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 12: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/12.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
Iterations of Message passing in AP
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 13: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/13.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
Iterations of Message passing in AP
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 14: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/14.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
Iterations of Message passing in AP
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 15: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/15.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
Iterations of Message passing in AP
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 16: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/16.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
Iterations of Message passing in AP
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 17: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/17.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
Iterations of Message passing in AP
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 18: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/18.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
Iterations of Message passing in AP
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 19: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/19.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
Introduction of AP
input:
Data: x1, x2, ..., xN Distance: d(xi , xj )
find:
σ: xi → σ(xi ), exemplar representing xi , such that
max∑N
i=1 S(xi , σ(xi ))
where,S(xi , xj) = −d2(xi , xj ) if i 6= j
S(xi , xi ) = −s∗ s∗: user-defined parameter (penalty)
s∗ = ∞, only one an exemplar ( one cluster)
s∗ = 0, every point is an exemplar (N clusters)
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 20: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/20.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
AP: a message passing algorithm
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 21: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/21.jpg)
Message passed
r(i , k) = S(xi , xk) − maxk′,k′ 6=k{a(i , k′) + S(xi , x
′k)}
r(k, k) = S(xk , xk) − maxk′,k′ 6=k{S(xk , x ′k)}
a(i , k) = min {0, r(k, k) +∑
i ′,i ′ 6=i ,k max{0, r(i ′, k)}}
a(k, k) =∑
i ′,i ′ 6=k max{0, r(i ′, k)}
The index of exemplar σ(xi ) associated to xi is finally defined as:
σ(xi ) = argmax {r(i , k) + a(i , k), k = 1 . . . N}
![Page 22: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/22.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Motivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
Summary of AP
Affinity Propagation (AP)
A clustering method
Converge by Iterations of Message passing
No need of K (the number of clusters)
Real point as exemplar
an application of belief propagation (simplified graph +message passing)
cons
Computational complexity problems
Similarity computation: O(N2)
Message passing: O(N2 log N)
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 23: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/23.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Hi-AP AlgorithmHi-AP Application on EGEE Grid logs
Contents
1 MotivationMotivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
2 Hierarchical AP (Hi-AP): Clustering Large-scale DataHi-AP AlgorithmHi-AP Application on EGEE Grid logs
3 StrAP : Clustering Streaming DataChallenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)
A StrAP-based Real-time Online Grid Monitoring System
4 Conclusion and Perspectives
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 24: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/24.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Hi-AP AlgorithmHi-AP Application on EGEE Grid logs
Hierarchical AP
Divide-and-conquer (inspired by Guha et al, TKDE2003)
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 25: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/25.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Hi-AP AlgorithmHi-AP Application on EGEE Grid logs
Hierarchical AP
Divide-and-conquer (inspired by Guha et al, TKDE2003)
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 26: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/26.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Hi-AP AlgorithmHi-AP Application on EGEE Grid logs
Weighted AP
AP WAP
xi xi , ni
S(xi , xj) −→ ni × S(xi , xj )
price for xi to select xj as an exemplar
S(xi , xi ) −→ S(xi , xi ) + (ni − 1) × ǫ
price to select xi as exemplar ǫ is variance of ni points
Proposition
WAP ≡ AP with duplications (aggregations)
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 27: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/27.jpg)
Hierarchical AP
Complexity of Hi-AP is O(N3/2)(X. Zhang et al, ECML/PKDD 2008)
NB: can be iteratively reduced to O(N1+γ)(X. Zhang et al, SIGKDD 2009)
![Page 28: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/28.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Hi-AP AlgorithmHi-AP Application on EGEE Grid logs
Validation of Hi-AP on EGEE jobs
EGEE(Enabling Grids forE-sciencE)
Grid Observatoryhttp://www.grid-
observatory.org/
description of jobs (237,087)
4 numeric features: duration of execution
1 symbolic feature: name of queue
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 29: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/29.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Hi-AP AlgorithmHi-AP Application on EGEE Grid logs
Validation of Hi-AP on EGEE jobs
Evaluation: Distortion
D([σ]) =∑N
i=1 d2(xi , σ(xi ))
50 100 150 200 250 3000
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2x 10
5
Dis
tort
ion
N. of clusters: K
Distortion of hierarchical K−centersDistortion of HI−AP simpleDistortion of HI−AP 237,087 jobs
10 minson Intel2.66GHzDual-Core PCwith 2 GBmemory
Hi-AP has the lowest distortion compared to baseline method
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 30: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/30.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Contents
1 MotivationMotivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
2 Hierarchical AP (Hi-AP): Clustering Large-scale DataHi-AP AlgorithmHi-AP Application on EGEE Grid logs
3 StrAP : Clustering Streaming DataChallenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)
A StrAP-based Real-time Online Grid Monitoring System
4 Conclusion and Perspectives
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 31: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/31.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Challenges of Stream Clustering
Data stream:
a real-time, continuous, ordered sequence of items arriving at avery high speed (Golab & Ozsu,SigMod2003)
e.g., network traffic data, sensor network monitoring data
Data streams clustering
Provide compact description of data flow
Incremental model updating
No specified number of clusters
Process in real-time
Available results at any time
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 32: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/32.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Related works
Divide-and-conquer strategy (Guha et al, TKDE 2003)fixed segmentation window —— > not feasible to handle the
changing distribution
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 33: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/33.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Related works
A two-level scheme (Aggarwal et al, VLDB 2003)
online level to summarize the evolving data streamoffline level to generate the clusters using the summary.clustering method is used to get initial micro-clusters and finalclusters. e.g., Density-based clustering methods DBSCAN (Cao etal, SDM 2006)
Problem: the online clustering models is not provided or onlyavailable when it is required by users.
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 34: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/34.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Contents
1 MotivationMotivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
2 Hierarchical AP (Hi-AP): Clustering Large-scale DataHi-AP AlgorithmHi-AP Application on EGEE Grid logs
3 StrAP : Clustering Streaming DataChallenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)
A StrAP-based Real-time Online Grid Monitoring System
4 Conclusion and Perspectives
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 35: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/35.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Stream clustering
e e e i i e i i e e i i
Model Reservoireeeeeeef jjjiiiij
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 36: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/36.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Stream clustering
e e e i i e i i e e i i e
Model Reservoireeeeeeefeeeeeeef jjjiiiij
Does xt fit the current model ??
if yes, update the model
otherwise, go to reservoir
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 37: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/37.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Stream clustering
e e e i i e i i e e i i e i
Model Reservoireeeeeeef jjjiiiijjjjiiiij
Does xt fit the current model ??
if yes, update the model
otherwise, go to reservoir
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 38: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/38.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Stream clustering
e e e i i e i i e e i i e i�@
Model Reservoireeeeeeef jjjiiiij �@
Does xt fit the current model ??
if yes, update the model
otherwise, go to reservoir
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 39: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/39.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Stream clustering
e e e i i e i i e e i i e i i e�@ i e� �@ @ �@
Model Reservoireeeeeeef jjjiiiij � � �@ @ @
Has the distribution changed ??
CHANGE TEST
if yes, rebuild the model
otherwise, continue
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 40: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/40.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Stream clustering
e e e i i e i i e e i i e i�@ i e� �@ @ �@
Model Reservoireeeeeeef jjjiiiij�@
Has the distribution changed ??
CHANGE TEST
if yes, rebuild the model
otherwise, continue
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 41: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/41.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
StrAP Method
data - -data streamingprocess system models { ei , ni ,Σi , ti }
Does xt fit the current model ??
if yes, update the model update the weight with time decay(decay window ∆)
otherwise, go to reservoir
Has the distribution changed ??
if yes, rebuilt the model based on current model andreservoir by WAP
otherwise, continue
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 42: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/42.jpg)
Rebuild the model??
when reservoir is full
when changes are detected: Page-Hinkley statistic(Cumulative-Sum-like test)
(Page, Biometrika1954; Hinkley, Biometrika1971)
0 100 200 300 400 500 600 700 800 900 1000−5
0
5
10
15
20
25
30
35
40
time t
pt
pt
mt
Mt
pt changing distribution
pt = 1t
Ptℓ=1 pℓ
mt =Pt
ℓ=1 (pℓ − pℓ + δ)
Mt = max{mℓ}
PHt = Mt − mt
if PHt > λ, changed detected
How to set λ ???
![Page 43: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/43.jpg)
Setting of λ
fixed empirical value (X. Zhang et al, ECML/PKDD 2008)
self-adaptive change detection test (X. Zhang et al, SIGKDD 2009)
Self-adapt λ ≡ An optimization problem
BIC: Fλ = 1|C |
∑|C |i=1
(
1ni
∑
ej∈Cid(ej , e
∗i )
)
+ ϕρ2 log N + ηOt
∝ loss + size of model + percentage of outlier
OPTIMIZATION:
ǫ-greedy search from a finite set of λ values
λ = argmin{E(Fλ}),
λ1 λ2 λ3 λ4 ...
E(Fλ1) E(Fλ2
) E(Fλ3) E(Fλ4
) ...
Gaussian Process Regression based on {λi ,Fλi}
continuous value of λ is generated
![Page 44: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/44.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Contents
1 MotivationMotivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
2 Hierarchical AP (Hi-AP): Clustering Large-scale DataHi-AP AlgorithmHi-AP Application on EGEE Grid logs
3 StrAP : Clustering Streaming DataChallenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)
A StrAP-based Real-time Online Grid Monitoring System
4 Conclusion and Perspectives
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 45: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/45.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Validation of StrAP on KDD99 data
Data used
Real world data: KDD99 data
intrusion detection benchmark494,021 network connection records in IR
34
23 classes: 1 normal + 22 attacks
Baseline: DenStream (Cao et al, SDM2006)
Performance indicator (supervised setting)
Clustering accuracy
Clustering purity
KDD Cup 1999 data: http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html.
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 46: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/46.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Accuracy and Purity along time
Error Rate along time < 2%
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5
x 105
0
0.5
1
1.5
2
time steps
Err
or R
ate
(%)
Error rateRestart point
Higher clustering purity than DenStream
1 2 3 480
85
90
95
100
time windows
Clu
ster
Pur
ity (
%)
STRAP ∆=15000 STRAP ∆=5000 DenStream
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 47: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/47.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Discussion
StrAP vs DenStream
Pros
better accuracyTruth Detection rate: 99.18%False Alarm rate: 1.39%Online Error rate < 2%model available at any time
Cons
DenStream: 7 secondsStrAP : 7 mins
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 48: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/48.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Contents
1 MotivationMotivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
2 Hierarchical AP (Hi-AP): Clustering Large-scale DataHi-AP AlgorithmHi-AP Application on EGEE Grid logs
3 StrAP : Clustering Streaming DataChallenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)
A StrAP-based Real-time Online Grid Monitoring System
4 Conclusion and Perspectives
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 49: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/49.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Multi-scale Realtime Grid Monitoring System
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 50: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/50.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Multi-scale Realtime Grid Monitoring System
1 2 3 4 50
20
40
60
80
100
700000
10 47 54129 0 0
8 18 24 30595139
7 13 14 24 972819190
Per
cent
age
of jo
bs a
ssig
ned
(%)
Outliers
Clusters
exemplar shown as a job vector
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 51: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/51.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Multi-scale Realtime Grid Monitoring System
0 20 40 60 80 100 120 140 1600
5
10
15
20
25
30
days
perc
enta
ge o
f job
s (%
)
distribution of jobs in cluster [7 0 0 0 0 0]
0 20 40 60 80 100 120 140 1600
10
20
30
40
50
60
70
80
90
days
perc
enta
ge o
f job
s (%
)
distribution of jobs in cluster [0 0 0 0 0 0]
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 52: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/52.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Experimental Data
EGEE logs of 39 RBs during 5 months (2006-01-01 ∼2006-05-31)
5,268,564 jobs
for each job, its
final status (good or type of errors)6 features describing the time-cost of services in a job lifecycle
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 53: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/53.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Experimental Results: Online Monitoringoutputs
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 54: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/54.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Real-time Monitoring: when change detected
Online summarizing the streaming jobs into clusters:
1 2 3 4 50
20
40
60
80
100
Reservoir
700000
10 47 54129 0 0
8 18 24 30595139
7 13 14 24 972819190
Clusters
Per
cent
age
of jo
bs a
ssig
ned
(%)
exemplar shown as a job vector
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 55: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/55.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Real-time Monitoring: when change detected
Online summarizing the streaming jobs into clusters:
1 2 3 4 5 6 7 80
20
40
60
80
100
Reservoir
000000
700000
10 47 54129 0 0
9 18 2520110 0 0
8 18 24 30595139
6 5 10 14 12710854
10 18 2920091 395 276
LogMonitor isgetting clogged
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 56: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/56.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Clustering Accuracy
0 1 2 3 4 5
x 106
80
85
90
95
100
time step
Acc
urac
y (%
)
StrAP with PH λ
t
streaming k−centers
10% higher than baseline method(Streaming k-centers)
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 57: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/57.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Discussion
Real-time quality (330K jobs/day):
tested on Intel 2.66GHz Dual-Core PC with 2 GB memory10k jobs per minute coding in Matlab60k jobs per minute coding in C/C++
concise online summary of the streaming jobs, with
proportion of defectsperformance of the grid services
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 58: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/58.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Experimental Results: Offline Analysis
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 59: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/59.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Large-time scale Monitoring: Global view
the history behavior of interesting exemplars
without prior knowledge about failure patterns
summarizing Gbyte data
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 60: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/60.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Bad Super Exemplars: day view
Days
Super Clusters
20 40 60 80 100 120 140
2
4
6
8
10
12
14
16
18
20 0
10%
20%
30%
40%
50%
60%
70%
80%
90%
“early stopped error”, Who and When ?Date Jan 7∼13 Jan 30 ∼ Feb 3 Mar 16∼21 May 17∼19
UserID A1 A1 B1 D1 and A1
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 61: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/61.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Challenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)A StrAP-based Real-time Online Grid Monitoring System
Discussion and Conclusion
real-time monitoring Grid job streams
providing multi-scale models to describing the status of Grid
proportion of different type of job patterns (realtime-view,day-view, week-view ....)rupture stepsoffline globally analysis
good quality clustering is guaranteed
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 62: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/62.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Contents
1 MotivationMotivation: Autonomic ComputingIntroduction of Affinity Propagation (AP)
2 Hierarchical AP (Hi-AP): Clustering Large-scale DataHi-AP AlgorithmHi-AP Application on EGEE Grid logs
3 StrAP : Clustering Streaming DataChallenges and Related WorkStrAP AlgorithmStrAP Application on Intrusion Detection (KDD99 data)
A StrAP-based Real-time Online Grid Monitoring System
4 Conclusion and Perspectives
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 63: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/63.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Conclusion, Algorithm
Scalability: Hi-AP
Reduce complexity from O(N2) to O(N3/2)
Iteratively reduce toward O(N (1+γ))
Stream clustering: StrAP
Framework of processing the streaming data
Hybridized with an efficient change detection method, Page-Hinkley
Model available at any time
BUT: slower than DenStream
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 64: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/64.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Conclusion, Application
Network Intrusion Detection (KDD99 data)
clustering by one-scan of the data
using only < 1% data for building model Active Learning
high clustering and classification accuracy
Autonomic Grid Computing
real-time grid monitoring system
visualized online output describing grid running status
offline output for historical performance analysis
multi-scale analysis of system behaviors
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew
![Page 65: Data Streaming for Autonomic Computing in the EGEE frameworksebag/Slides/seminar_XZ_june.pdf · 2010-06-14 · real data points as representative exemplars (cluster center): suit](https://reader033.fdocuments.in/reader033/viewer/2022042007/5e707396c2328412203d6ab5/html5/thumbnails/65.jpg)
MotivationHierarchical AP (Hi-AP): Clustering Large-scale Data
StrAP : Clustering Streaming DataConclusion and Perspectives
Ongoing work
Flexible Clustering Methods
Fixed number clusters by messaging passing
Arbitrary shape clusters by messaging passing
Comprehensive model of streaming datausing several representative exemplars covering the cluster, instead
of one center point
Online Learning
Assess the alarm level attached to a given modelcriticality of the clusters based on its frequency along time
User profilingthe clusters —> new features —> describe the users (viewing a
user as a set of clusters)
Xiangliang Zhang, Cyril Furtlehner, Michele Sebag Data Streaming for Autonomic Computing in the EGEE framew