1 Very Large-Scale Incremental Clustering Berk Berker Mumin Cebe Ismet Zeki Yalniz 27 March 2007.

24
1 Very Large-Scale Incremental Clustering Berk Berker Mumin Cebe Ismet Zeki Yalniz 27 March 2007

Transcript of 1 Very Large-Scale Incremental Clustering Berk Berker Mumin Cebe Ismet Zeki Yalniz 27 March 2007.

1

Very Large-Scale Incremental Clustering

Berk BerkerMumin Cebe

Ismet Zeki Yalniz

27 March 2007

2

Table of Contents

Why Clustering?Why Incremental Clustering?Related WorkIncremental C3M (C2ICM)A Former Implementation of C2ICM for

very large datasetsConclusion

3

Why clustering ?

It is an effective tool to manage information overload

To browse large document collections quickly

To easily grasp the distinct topics and subtopics (concept hierarchies)

To allow search engines to efficiently query large document collections

4

Types of Clustering

Hierarchical vs. Non-hierarchical Partitional vs. Agglomerative Deterministic vs. Probabilistic algorithms Incremental vs. Batch algorithms

5

Why Incremental Clustering ?

The current information explosion

Popular sources of informational text documents such as Newswire and Blogs

Delay would be unacceptable in several important areas

6

Related Work

The cluster-splitting approach Adaptive clustering based on user

queries Cobweb algorithmHierarchical Clustering in Incremental

manner

7

C2ICM Algorithm

C3M is known as an efficient, effective and robust algorithm for clustering documents

C3M is well-developed for initial clustering, but maintenance is also necessary in clustering

8

C2ICM algorithm is based on cover coefficient concept as C3M.

C2ICM is suitable for dynamic environments where there are additions and deletions of documents

With C2ICM, reclustering for each update is avoided.

C2ICM Algorithm

9

C2ICM Algorithm Details

First we compute the number of clusters and cluster seed powers in the updated database

Then we determine the newly added documents and falsified documents

10

How do the clusters become false?

When a seed document becomes non-seed or is deleted

One or more non-seed documents of that cluster becomes seed

C2ICM Algorithm Details

11

C2ICM Algorithm Details

We cluster these documents by assigning them to the cluster of the seed that covers them most

The documents which does not belong to any cluster are grouped into ragbag cluster

12

C2ICM: An example

Current state of the clusters

d5 d4 d3

d1d7 d2

d8 d9 d15

d6d10 d11

d18 d16 d17

d12d13 d14

Ragbag cluster

Seed Listd1d6d12

d19

13

C2ICM: CASE 1

When a seed document becomes nonseed

d5 d4 d3

d1d7 d2

The set of documents to be clustered

New Seed Listd1d6d13d19

New documents arrived

d19 d20 d21

d22

Old Seed Listd1d6d12

d18 d16 d17

d12d13 d14

d8 d9 d15

d6d10 d11

14

C2ICM: CASE 1

Seed document d12 becomes nonseed

d5 d4 d3

d1d7 d2

d22 d13 d14

d12 d16 d17 d18 d19 d20

d21

The set of documents to be clustered

New Seed Listd1d6d13d19

d8 d9 d15

d6d10 d11

15

C2ICM: CASE 1

d5 d4 d3

d1d7 d2

New Seed Listd1d6d13d19

d20 d16 d12

d13d18

d21 d14 d17

d19 d22

No elements remaining in the ragbag cluster

Final clusters

d8 d9 d15

d6d10 d11

16

When a nonseed document in a cluster becomes seed

Old Seed Listd1d6d12

New documents arrived

The set of documents to be clustered

C2ICM: CASE 2

New Seed Listd1d6d12d14

d5 d4 d3

d1d7 d2

d19 d20 d21

d22

d18 d16 d17

d12d13 d14

d8 d9 d15

d6d10 d11

17

Nonseed document d14 becomes seed.

d5 d4 d3

d1d7 d2

d12 d13 d14

d16 d17 d18 d19 d20

d21 d22

New Seed Listd1d6d12d14

The set of documents to be clustered

Becomes new seed

C2ICM: CASE 2

d8 d9 d15

d6d10 d11

18

C2ICM: CASE 2

d5 d4 d3

d1d7 d2

d20 d16 d13

d12d22 d18

d21 d19 d17

d14

New Seed Listd1d6d12d14

No elements remaining in the ragbag cluster

Becomes new seed

Final clusters

d8 d9 d15

d6d10 d11

19

A Former Implementation of C2ICM for Very Large Datasets

C2ICM is implemented by two programs (VS Pascal)

Program I selects the seeds Program II clusters documents by using

C2ICM algorithm. These programs communicate by exchanging

files.

Program ISeed Selector

Program IIC2ICM

text filesdocuments clusters

20

Former Experiments

C2ICM is tested with a subset of MARIAN database (~43K documents) in 1995.

6 experiments are done. Each incremental update added ~6K documents to the different sizes of initially clustered documents

21

Results for the Former Experiments

• C2ICM provides time savings• Clusters formed with C2ICM was very similar to the clusters formed with C3M

22

Conclusion

Cluster maintenance problem is challenging

Our aim is to conduct experiments for C2ICM with very large number of documents (i.e. millions of documents)

HARD dataset will be used for evaluation. Information retrieval performance will be measured.

Implementation of C2ICM must be time and memory efficient.

23

References

Can, F., Ozkarahan, E. A.  "Concepts and effectiveness of the cover coefficient-based clustering methodology for text databases."  ACM Transactions on Database Systems.  Vol. 15, No. 4 (December, 1990), pp. 483-517.

Can, F.  "Incremental clustering for dynamic information processing."  ACM Transactions on Information Systems.  Vol. 11, No. 2 (April, 1993), 143-164.

Can, F., Fox, E. A., Snavely, C. D., France, R. K.  "Incremental clustering for very large document databases: initial MARIAN experience."  Information Sciences.  Vol. 84 (1995), pp. 101-114.

A. K. Jain , M. N. Murty , P. J. Flynn, Data clustering: a review, ACM Computing Surveys (CSUR), v.31 n.3, p.264-323, Sept. 1999

24

Questions?