1 Very Large-Scale Incremental Clustering Berk Berker Mumin Cebe Ismet Zeki Yalniz 27 March 2007.
-
Upload
maura-gaye -
Category
Documents
-
view
239 -
download
0
Transcript of 1 Very Large-Scale Incremental Clustering Berk Berker Mumin Cebe Ismet Zeki Yalniz 27 March 2007.
2
Table of Contents
Why Clustering?Why Incremental Clustering?Related WorkIncremental C3M (C2ICM)A Former Implementation of C2ICM for
very large datasetsConclusion
3
Why clustering ?
It is an effective tool to manage information overload
To browse large document collections quickly
To easily grasp the distinct topics and subtopics (concept hierarchies)
To allow search engines to efficiently query large document collections
4
Types of Clustering
Hierarchical vs. Non-hierarchical Partitional vs. Agglomerative Deterministic vs. Probabilistic algorithms Incremental vs. Batch algorithms
5
Why Incremental Clustering ?
The current information explosion
Popular sources of informational text documents such as Newswire and Blogs
Delay would be unacceptable in several important areas
6
Related Work
The cluster-splitting approach Adaptive clustering based on user
queries Cobweb algorithmHierarchical Clustering in Incremental
manner
7
C2ICM Algorithm
C3M is known as an efficient, effective and robust algorithm for clustering documents
C3M is well-developed for initial clustering, but maintenance is also necessary in clustering
8
C2ICM algorithm is based on cover coefficient concept as C3M.
C2ICM is suitable for dynamic environments where there are additions and deletions of documents
With C2ICM, reclustering for each update is avoided.
C2ICM Algorithm
9
C2ICM Algorithm Details
First we compute the number of clusters and cluster seed powers in the updated database
Then we determine the newly added documents and falsified documents
10
How do the clusters become false?
When a seed document becomes non-seed or is deleted
One or more non-seed documents of that cluster becomes seed
C2ICM Algorithm Details
11
C2ICM Algorithm Details
We cluster these documents by assigning them to the cluster of the seed that covers them most
The documents which does not belong to any cluster are grouped into ragbag cluster
12
C2ICM: An example
Current state of the clusters
d5 d4 d3
d1d7 d2
d8 d9 d15
d6d10 d11
d18 d16 d17
d12d13 d14
Ragbag cluster
Seed Listd1d6d12
d19
13
C2ICM: CASE 1
When a seed document becomes nonseed
d5 d4 d3
d1d7 d2
The set of documents to be clustered
New Seed Listd1d6d13d19
New documents arrived
d19 d20 d21
d22
Old Seed Listd1d6d12
d18 d16 d17
d12d13 d14
d8 d9 d15
d6d10 d11
14
C2ICM: CASE 1
Seed document d12 becomes nonseed
d5 d4 d3
d1d7 d2
d22 d13 d14
d12 d16 d17 d18 d19 d20
d21
The set of documents to be clustered
New Seed Listd1d6d13d19
d8 d9 d15
d6d10 d11
15
C2ICM: CASE 1
d5 d4 d3
d1d7 d2
New Seed Listd1d6d13d19
d20 d16 d12
d13d18
d21 d14 d17
d19 d22
No elements remaining in the ragbag cluster
Final clusters
d8 d9 d15
d6d10 d11
16
When a nonseed document in a cluster becomes seed
Old Seed Listd1d6d12
New documents arrived
The set of documents to be clustered
C2ICM: CASE 2
New Seed Listd1d6d12d14
d5 d4 d3
d1d7 d2
d19 d20 d21
d22
d18 d16 d17
d12d13 d14
d8 d9 d15
d6d10 d11
17
Nonseed document d14 becomes seed.
d5 d4 d3
d1d7 d2
d12 d13 d14
d16 d17 d18 d19 d20
d21 d22
New Seed Listd1d6d12d14
The set of documents to be clustered
Becomes new seed
C2ICM: CASE 2
d8 d9 d15
d6d10 d11
18
C2ICM: CASE 2
d5 d4 d3
d1d7 d2
d20 d16 d13
d12d22 d18
d21 d19 d17
d14
New Seed Listd1d6d12d14
No elements remaining in the ragbag cluster
Becomes new seed
Final clusters
d8 d9 d15
d6d10 d11
19
A Former Implementation of C2ICM for Very Large Datasets
C2ICM is implemented by two programs (VS Pascal)
Program I selects the seeds Program II clusters documents by using
C2ICM algorithm. These programs communicate by exchanging
files.
Program ISeed Selector
Program IIC2ICM
text filesdocuments clusters
20
Former Experiments
C2ICM is tested with a subset of MARIAN database (~43K documents) in 1995.
6 experiments are done. Each incremental update added ~6K documents to the different sizes of initially clustered documents
21
Results for the Former Experiments
• C2ICM provides time savings• Clusters formed with C2ICM was very similar to the clusters formed with C3M
22
Conclusion
Cluster maintenance problem is challenging
Our aim is to conduct experiments for C2ICM with very large number of documents (i.e. millions of documents)
HARD dataset will be used for evaluation. Information retrieval performance will be measured.
Implementation of C2ICM must be time and memory efficient.
23
References
Can, F., Ozkarahan, E. A. "Concepts and effectiveness of the cover coefficient-based clustering methodology for text databases." ACM Transactions on Database Systems. Vol. 15, No. 4 (December, 1990), pp. 483-517.
Can, F. "Incremental clustering for dynamic information processing." ACM Transactions on Information Systems. Vol. 11, No. 2 (April, 1993), 143-164.
Can, F., Fox, E. A., Snavely, C. D., France, R. K. "Incremental clustering for very large document databases: initial MARIAN experience." Information Sciences. Vol. 84 (1995), pp. 101-114.
A. K. Jain , M. N. Murty , P. J. Flynn, Data clustering: a review, ACM Computing Surveys (CSUR), v.31 n.3, p.264-323, Sept. 1999