A matrix density based algorithm to hierarchically co-cluster documents and words

Advisor ： Dr. HsuGraduate ： Keng-Wei ChangAuthor ： Bhushan Mandhani

Sachindra Joshi Krishna Kummamuru

outline

Motivation Objective Introduction background Rowset Partitioning and Submatrix Agglomeration(RPSA)

Experimental results Conclusions Personal Opinion

Motivation

With this explosion of unstructured information, it has become increasingly important to organize the information in a comprehensible and navigable manner.

Objective

A hierarchical arrangement of documents is very useful in browsing a document collection, as the popularity of the Yahoo 、 Google.

This paper proposes an algorithm to hierarchically cluster documents for solving problems.

Introduction

90s -> 100 thousand pages ； 2002 -> 2 billion pages; it has become increasingly important to organize

the information Manually is accurate, but not always feasible Need tools to automatically arrange documents t

o labeled hierarchies Propose RPSA -> two step partitional-agglomerative

background

Vector Model for Documents Evaluation of Clustering Quality Evaluation of Hierarchical Clustering

Vector Model for Documents

We have d documents

Document i is represented by

is the number of occurrences of word j in document i

Term Frequency ， TF

Inverse Document Frequency ， IDF

Unitized-TF IDF

Evaluation of Clustering Quality

1. Purity ：

2. Entropy ：

i ijj ppE

Evaluation of Hierarchical Clustering

Rowset Partitioning and Submatrix Agglomeration(RPSA)

tow-step partitional-agglomerative algorithm 1th step ： The Partitioning Step 2th step ： The Agglomerative Step

The Partitioning Step

Define the density of submatices

a row r， a column c

a set R of rows ， a set C of columns

Generating a Leaf Cluster

Choice of Leader Documents

The sum of TFIDF vector representing that document

Documents with relatively large lengths were observed to be better leader documents for the algorithm above

The Complete Partitioning Algorithm

Complexity Analysis The time complexity is O(mz) The space complexity is O(z)

The Agglomerative Step

Reduce the number of clusters The similarity measure between two clusters

for merging Flat Clustering Hierarchical Clustering

The Agglomerative Step

Complexity Analysis The time complexity is O( ) The space complexity is O( )

Experimental results-Flat Clustering

Data Sets

Results

Experimental results-Hierarchical Clustering

Data Sets

Results

Conclusions

It is comparable with or better than the best k-means run

It’s performance does not degrade on small data sets

It’s acceptable on purity in hierarchy

Personal Opinion

A matrix density based algorithm to hierarchically co-cluster documents and words

Documents

Transcript of A matrix density based algorithm to hierarchically co-cluster documents and words

Hierarchically Structured Optical Materials

DMN09-Hierarchically Distributed Peer-To-peer Document Clustering and Cluster Summarization

Hierarchically Focused Guardbanding: An Adaptive Approach ...

Optimizing Cluster Density on Illumina Sequencing Systems · PDF fileOptimizing Cluster Density on Illumina Sequencing Systems Understanding cluster density limitations and strategies

Thesis Proposal Hierarchically-Synthesized Network Servicespach/proposal.pdf · Thesis Proposal Hierarchically-Synthesized Network Services ... In this proposal, ... ferencing facilities

people.math.ethz.chsprianod/Quasiconvexity_HHS.pdf · CONVEXITY IN HIERARCHICALLY HYPERBOLIC SPACES JACOB RUSSELL, DAVIDE SPRIANO, AND HUNG CONG TRAN Abstract. Hierarchically hyperbolic

Electrohydrodynamic-assisted Assembly of Hierarchically ...yylab.seas.ucla.edu/papers/srep38701.pdf · Electrohydrodynamic-assisted Assembly of Hierarchically Structured, 3D Crumpled

Adsorption-Induced Deformation of Hierarchically ...sol.rutgers.edu/~aneimark/PDFs/BalzerEtAl_AnisotropicDeformation... · Adsorption-Induced Deformation of Hierarchically Structured

Scheduling Sleeping Nodes in High Density Cluster …libres.uncg.edu/ir/uncg/f/J_Deng_Scheduling_2005a.pdf · Scheduling Sleeping Nodes in High Density ... Nodes in High Density Cluster-based

HIERARCHICALLY HYPERBOLIC SPACES I: CURVE …hierarchically hyperbolic spaces i: curve complexes for cubical groups 7 (3)Teichmüller space T pSqwith the Weil-Petersson metric is hierarchically

High-density limits of hierarchically structured branching ... · ELSEVIER Stochastic Processes and their Applications 62 (1996) 191-222 stochastic processes and their applications

Chapter DM:II - webis.de · Chapter DM:II II.Cluster Analysis q Cluster Analysis Basics q Hierarchical Cluster Analysis q Iterative Cluster Analysis q Density-Based Cluster Analysis

CSE601 Density-based Clustering · Density-based Clustering •Basic idea –Clusters are dense regions in the data space, separated by regions of lower object density –A cluster

HIERARCHICALLY-STRUCTURED VARIATIONAL ... - OpenReview

Cluster Analysis - VUBThe density-based methods cluster instances based on the distance between instances, which can nd arbitrarily shaped clusters. It can cluster instances as dense

Hierarchically Clustered Representation Learning

Stochastic Dynamics through Hierarchically Embedded Markov ...web.ist.utl.pt/franciscocsantos/MyArticles/Va... · Stochastic Dynamics through Hierarchically Embedded Markov Chains

Strain-Controlled Switching of Hierarchically …...Strain-Controlled Switching of Hierarchically Wrinkled Surfaces between Superhydrophobicity and Superhydrophilicity Zuoqi Zhang,†

Learning and Inference for Hierarchically Split Probabilistic ...nlp.cs.berkeley.edu/pubs/Petrov-Klein_2007_Learning...Learning and Inference for Hierarchically Split Probabilistic

Adaptive Object Representation with Hierarchically-Distributed ...