4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE...
-
Upload
marianna-hubbard -
Category
Documents
-
view
215 -
download
3
Transcript of 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE...
![Page 1: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/1.jpg)
4/8/2002 Copyright Daniel Barbara
Clustering by impact
Daniel Barbará
George Mason University
ISE Dept.
http://www.ise.gmu.edu/~dbarbara
(joint work with P. Chen, J. Couto, and Y. Li)
![Page 2: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/2.jpg)
ProblemOrganizations are constantly acquiring and storing new data (data streams)The need to quickly extract knowledge from the newly arrived data (and compare it with the old) is pressing.Applications:
Intrusion detectionTuning Intelligence analysis
![Page 3: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/3.jpg)
Outline
Clustering data streams
Our method
Continuous data: Fractal Clustering
Categorical (nominal) data: Entropy-based
Tracking clusters
Future work
![Page 4: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/4.jpg)
Clustering and data streams
To cluster continuously arriving data streams a clustering algorithm should behave incrementally: make the decision based on the newly arrived point and a concise description of the clusters encountered so far.
Concise bounded amount of RAM to describe the clusters, independently of the number of data points processed so far…
![Page 5: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/5.jpg)
Problem (cont.)
Most algorithms in the literature do not have that property:
They look at the entire set of points at once (e.g., K-means)
They cannot make decisions point by point.
The description of the clusters is usually the set of points in them.
Some of the algorithms have high complexity
![Page 6: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/6.jpg)
Some inroadsPaper by U. Fayyad, D. Bradley and C. Reina:
“Scaling Clustering algorithms to large databases” (KDD’98)
Main idea: keep descriptions of centroids + set descriptions that are likely and unlikely to change given a new data point.Papers by Motwani, et al. Incrementally updating centroids while receiving a data stream. The goal is to have an approximation to “min squares” whose performance is bounded.
![Page 7: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/7.jpg)
Our proposal
Find functions that naturally define clusters and that can be easily computed given a new point and a concise representation of the current clusters.
Place a new point in the cluster for which the evaluated function shows a minimum (or a maximum) – less impact---
![Page 8: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/8.jpg)
“Impact” functions
Numerical data points: fractal dimensionMeasures the self-similarity of points.The idea is that the lower the change in the fractal dimension (when the point is included), the more self-similar the point is w/respect to the cluster
Categorical data points: entropy.Also measures similarityLower entropy means similar points.
![Page 9: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/9.jpg)
Fractal Clustering Fractal dimension, is a (not necessarily integer) number that characterizes the number of dimensions ``filled'' by the object represented by the dataset. The object on the upper right corner, called the Menger sponge (when complete) has a F.D. equal to 2.73 (less than the embedding space, whose dimension is 3)
Conjecture: if part of a dataset brings about a change in the overall fractal dimension of the set, then this part is ``anomalous'' (exhibits different behavior) with respect to the rest of the dataset.
![Page 10: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/10.jpg)
Fractal dimension
r = grid size
log log
1log
log
( 1) log
{i i
i
qi
i
p p
for qr
q p
otherwiseq r
D
ip Probability distribution
![Page 11: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/11.jpg)
Box CountingCantor Dust Set
![Page 12: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/12.jpg)
Box counting (cont.)
log 2n
D1 = - limn-> = 0.63 log ( )n
p r
2 r0
4 r0 / 3
8 r0 /9
Population vs. grid size (logxlog)
8
4
2
1
10
1 2 3
r
po
p.
![Page 13: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/13.jpg)
Initialization Algorithm
Take an unlabelled point in the sample and start a cluster.
Find close neighbors and add them to the cluster.
Find close neighbors to points in the cluster… If you can’t go to first step.
![Page 14: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/14.jpg)
Space management
![Page 15: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/15.jpg)
Space management
Space in RAM is not proportional to the size of the dataset, but rather to the size of the grid and number of grid levels kept.
These vary with:
Dimensionality
Accuracy (odd-shaped clusters may require more levels).
![Page 16: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/16.jpg)
Experiments
Dataset1
![Page 17: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/17.jpg)
Scalability results with Dataset1
execution time
0200040006000
number of points
seco
nd
s
t
![Page 18: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/18.jpg)
Quality of clusters (Dataset1)
Percentage of points clustered right
0
50
100
Dataset size
%
C1
C2
C3
![Page 19: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/19.jpg)
High dimensional set
10 dimensions, 2 clusters
% of points clustered right
94.3
100
9092949698
100
C1 C2
Cluster
%
C1
C2
![Page 20: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/20.jpg)
Results with the noisy dataset
92 % of the noise gets filtered out.
% points clustered right
99.57 100
83.62
60
80
100
C1 C2 C3
Cluster
%
C1
C2
C3
![Page 21: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/21.jpg)
Memory usage vs. dimensions
Memory used vs. dimensions
64500
2,000
0
1000
2000
3000
1 2 3 4 5 6 7 8 9 10
dimensions
Siz
e (
Kb
.)
Size(Kb)
![Page 22: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/22.jpg)
Memory reduction
Space taken by the boxes is small, but it grows with the number of dimensions.
Memory reduction techniques:
• Use boxes with # points > epsilon.
• Cache boxes
• Have only smallest granularity boxes and derive the rest.
None of them causes a significant degradation of quality. (2 and 3 have an impact on running time.)
![Page 23: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/23.jpg)
Memory reduction
19 25
75
55
0
20
40
60
80
1 2 3 4
Technique
% Memory reduction
![Page 24: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/24.jpg)
Comparison with other algorithms
Comparison of quality
0 50 100 150
C1
C2
C1
C2
CU
RE
FC
Alg
ori
thm
%
right
outliers
![Page 25: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/25.jpg)
Entropy-based Clustering (COOLCAT)
For Categorical dataPlace new point where it minimizes some function of the entropies of the individual clusters (e.g., min (max (entropy Ci)))Heuristic (problem is NP-Hard)Entropy of each cluster:
Minimize expected entropy 1,... 1,..
( ) ( / ) log ( / )i
ki d j j
E C P Vij Ck P Vij Ck
![Page 26: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/26.jpg)
Initialization
Need to seed “k” clusters:Select a sample
Find 2 points that are the most dissimilar (their joint entropy is the highest).
Place them in 2 different clusters
Find another point that is the most dissimilar (pairwise) to the ones selected, and start another cluster.
![Page 27: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/27.jpg)
Incremental phase
For a given point and k current clusters:Compute the expected entropy as the new point is placed in each cluster.Choose the one that minimizes the expected entropy After finishing with a batch of points, re-process m% of them (take the ``worse’’ fits out and re-cluster them): helps with the issue of order dependency
![Page 28: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/28.jpg)
Conciseness
Notice that the current cluster description is concise:
Counts of Vij for every i= 1,.., d (number of attributes), and for every j (domain of each attribute)
![Page 29: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/29.jpg)
COOLCAT and the MDL
MDL = minimum description length.
Widely used to argue about how good a classifier is: how many bits does it take to send to a receiver the description of your classifier + the exceptions (misclassifications)
![Page 30: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/30.jpg)
MDL (cont.)
( , ) ( ) ( using )
model, D = data
K h D K h K D h
h
( ) log(| |)
( , ) ( )
K h k D
K h D E C
C clustering
![Page 31: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/31.jpg)
Experimental results
Real and synthetic datasets
Evaluate quality and performance
Quality: Category utility function (how much “better” is the distribution probability in the individual clusters w/respect to the original distribution)
External entropy: take an attribute not used in the clustering and compute the entropy of each cluster w/respect to it, then the expected external entropy
![Page 32: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/32.jpg)
Experimental resultsArchaeological data set
Alg. m CU Ext. E Exp E
Coolcat 0 0.7626 0 4.8599
Coolcat 10 0.7626 0 4.8599
Coolcat 20 0.7626 0 4.8599
Brute F. - 0.7626 0 4.8599
ROCK - 0.3312 0.96 -
![Page 33: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/33.jpg)
KDD99 Cup data (intrusion detection)
k
Exp E
CU
Ext
E
![Page 34: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/34.jpg)
Performance (synthetic data)
N x 1000
T
(sec.)
![Page 35: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/35.jpg)
Tracking clustersClustering data streams as they come:Consider r.v X = 0 if new point is outlier; 1 otherwise.Using Chernoff bounds:Must see s “successes” – not outliers– in a window w
If you don’t, it is time for new clusters…
2
3(1 ) 2ln( )s
2
2(1 ) 2ln( )
(1 )w
p
![Page 36: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/36.jpg)
FC, COOLCAT and Tracking
Find a good definition of outlier:FC: if the min change in FD exceeds a threshold.
COOLCAT: mutual information of new point with respect to clusters
![Page 37: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/37.jpg)
One tracking experiment with FC
![Page 38: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/38.jpg)
One tracking experiment with COOLCAT (intrusion detection)
Mutual Information
density
attacksNo
attacks
![Page 39: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/39.jpg)
Hierarchical clustering
More tracking experiments
Hybrid data: numeric and categorical
Indexing based on clustering
…
![Page 40: 4/8/2002Copyright Daniel Barbara Clustering by impact Daniel Barbará George Mason University ISE Dept. dbarbara (joint work with.](https://reader035.fdocuments.in/reader035/viewer/2022081603/56649f315503460f94c4ce1d/html5/thumbnails/40.jpg)
Bibliography``Using the Fractal Dimension to Cluster Datasets,'' Proceedings of the the ACM-SIGKDD International Conference on Knowledge and Data Mining , Boston, August 2000. D. Barbara, P.Chen. ``Tracking Clusters in Evolving Data Sets,'' Proceedings of FLAIRS'2001, Special Track on Knowledge Discovery and Data Mining , Key West, FL, May 2001. D. Barbara, P. Chen. ``Fractal Characterization of Web Workloads,'' Proceedings of the 11th International World Wide Web Conference, May 2002. D. Menasce, V. Almeida, D. Barbara, B. Abrahao, and F. Ribeiro. ``Using Self-Similarity to Cluster Large Data Sets,’’ to appear in Journal of Data Mining and Knowledge Discovery, Kluwer Academic pub. D. Barbara, P.Chen``Requirements for Clustering Data Streams,'' SIGKDD Explorations (Special Issue on Online, Interactive, and Anytime Data Mining), Vol. 3, No. 2, Jan 2002. D. Barbara``COOLCAT: An Entropy-Based Algorithm for Categorical Clustering,’’ Submitted for publication. D. Barbara, J. Couto, Y. Li.