A matrix density based algorithm to hierarchically co-cluster documents and words

Post on 03-Jan-2016

21 views 0 download

Tags:

description

A matrix density based algorithm to hierarchically co-cluster documents and words. Advisor : Dr. Hsu Graduate : Keng-Wei Chang Author : Bhushan Mandhani Sachindra Joshi Krishna Kummamuru. outline. Motivation Objective Introduction background - PowerPoint PPT Presentation

Transcript of A matrix density based algorithm to hierarchically co-cluster documents and words

A matrix density based algorithm to hierarchically co-cluster documents and words

Advisor : Dr. HsuGraduate : Keng-Wei ChangAuthor : Bhushan Mandhani

Sachindra Joshi Krishna Kummamuru

outline

Motivation Objective Introduction background Rowset Partitioning and Submatrix Agglomeration(RPSA)

Experimental results Conclusions Personal Opinion

Motivation

With this explosion of unstructured information, it has become increasingly important to organize the information in a comprehensible and navigable manner.

Objective

A hierarchical arrangement of documents is very useful in browsing a document collection, as the popularity of the Yahoo 、 Google.

This paper proposes an algorithm to hierarchically cluster documents for solving problems.

Introduction

90s -> 100 thousand pages ; 2002 -> 2 billion pages; it has become increasingly important to organize

the information Manually is accurate, but not always feasible Need tools to automatically arrange documents t

o labeled hierarchies Propose RPSA -> two step partitional-agglomerative

background

Vector Model for Documents Evaluation of Clustering Quality Evaluation of Hierarchical Clustering

Vector Model for Documents

We have d documents

Document i is represented by

is the number of occurrences of word j in document i

Term Frequency , TF

Inverse Document Frequency , IDF

im

ijt

Unitized-TF IDF

Evaluation of Clustering Quality

1. Purity :

2. Entropy :

ij

g

i ijj ppE

1log-

Evaluation of Hierarchical Clustering

Rowset Partitioning and Submatrix Agglomeration(RPSA)

tow-step partitional-agglomerative algorithm 1th step : The Partitioning Step 2th step : The Agglomerative Step

The Partitioning Step

Define the density of submatices

a row r, a column c

a set R of rows , a set C of columns

The Partitioning Step

Generating a Leaf Cluster

The Partitioning Step

Choice of Leader Documents

The sum of TFIDF vector representing that document

Documents with relatively large lengths were observed to be better leader documents for the algorithm above

The Partitioning Step

The Complete Partitioning Algorithm

The Partitioning Step

Complexity Analysis The time complexity is O(mz) The space complexity is O(z)

The Agglomerative Step

Reduce the number of clusters The similarity measure between two clusters

for merging Flat Clustering Hierarchical Clustering

The Agglomerative Step

Complexity Analysis The time complexity is O( ) The space complexity is O( )

zm2

2m

Experimental results-Flat Clustering

Data Sets

Experimental results-Flat Clustering

Results

Experimental results-Flat Clustering

Experimental results-Hierarchical Clustering

Data Sets

Experimental results-Hierarchical Clustering

Data Sets

Experimental results-Hierarchical Clustering

Results

Conclusions

It is comparable with or better than the best k-means run

It’s performance does not degrade on small data sets

It’s acceptable on purity in hierarchy

Personal Opinion