Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion...

Clustering for web documents 1

Clustering for web documents

박흠


ContentsCluto

Criterion Functions for Document Clustering* Experiments and Analysis (2002)

by Ying Zhao and George Karypis Department of Computer Science, University of Minnesot

a, Minneapolis, MN 55455

Feature selection for web documents(2004)


ClutoClustering Toolkit. 2.1.1

Department of Computer Science, University of Minnesota, Minneapolishttp://www-users.cs.umn.edu/~karypis/platform

Linux 2.4.18Sun OS 5.7Win32

programsCLUTO's user callable libraryvclusterscluster

http://www-users.cs.umn.edu/~karypis/


ClutoWhat is Cluto.(1/2)

Clustering algorithmspartitional clusteringagglomerative clusteringgraph-partitioning clustering

clustering criterion functionprovide seven different criterion functions

both partitional and agglomerative clustering algorithms

provide some of the more traditional local criteria (e.g., single-link, complete-link, and UPGMA)

agglomerative clustering.


ClutoWhat is Cluto.(2/2)

Analyze discovered clustersrelations between the objects assigned to each clusterrelations between the different clustersidentify the features that best describe and/or discriminate each cluster. relationships between the clusters, objects, and features.

operate on very large datasetsthe number of objectsthe number of dimensions.


ClutoPrograms

vclusteroperate in the object’s feature space

sclusteroperate in the object’s similarity space.

Interfacevcluster [optional parameters] MatrixFile Ncluster

n*m matrix. rows to objects, cols to features spaceNcluster : number of cluster


ClutoParameters of Algorithms

rd, rdrk-1 repeated bisections. (rdr : optimize the criterion function)

directcomputed by simultaneously finding all k clusters

agglothe agglomerative paradigm

graphusing a nearest-neighbor graph

bagglo


ClutoParameters of the similarity function

cos the cosine function. default.corr the correlation coefficient.dist the Euclidean distance

applicable when -clmethod=graph.

jacc the extended Jaccard coefficient. applicable when -clmethod=graph.


ClutoParameters of the criterion function

i1, i2, e1, g1, g1p, h1, h2


ClutoParameters of the criterion function

slink single linkwslink weighted single linkclink complete linkwclink weighted complete linkupgmaUPGMA

cstypefulltreerowmodel, colmodelshowfeatures


Criterion Functions for Document Clustering Experiments and Analysis (200

2)

by Ying Zhao and George Karypis Department of Computer Science, University of Minnesota, Minneapolis, MN 55455


Data Clustering

A.K. JAINMichigan State University

M.N. MURTYIndian Institute of Science

ANDP.J. FLYNN

The Ohio State UniversityACM Computing Surveys


Introduction(1/2)Clustering algorithms

Agglomerative algorithmsUPGMA, single-link, complete-link, CURE, ROCK, Chameleon

Partitional algorithmsK-means, K-medoids, Autoclass, graph-partitional-based, spectral-partitional-basedwell suit for large datasets. so fast.

Seven Criterion functionsmeasure intra-cluster similarity, inter-cluster similarity, two combinations. i1, i2, e1, g1, g1p, h1, h2


Introduction(2/2)Datasets

15 different data sets


Preliminaries(1/3)Document Representation

use vector space model for each document

d : document, tf : term frequency, tfi : frequency of i-th term in the doc

use idf or tf*idf N : total documents

Similarity MeasuresThe similarity between two docs di, dj

Cosine functions ||d|| : normalize the length of doc vector

1 : identical, 0 : nothing in common


Preliminaries(2/3)Euclidean functions

if dis=0, docs are identical, if , nothing in common.

DefinitionsS : set of documents

S1, S2, … Sk : set of document of k-th clusterk : number of clusters

n1, n2, … nk : size docs of the corresponding clustersA : a set of docs

composite vector DA centroid vector CA. sum of all docs vector in A average the weight of terms of docs in A


Preliminaries(3/3)Vector Properties

Si, Sj : two sets of docs containing ni, nj documents

Di, Dj : the composite vector, Ci, Cj : the centroid vector

The sum of the pair similarity between the docs in Si and Sj is Dj

tDj

The sum of the pair similarity between the docs in Si is ||Di||2


Criterion Functions(1/5)Internal Criterion Functions

maximize sum of the average pairwise similarities between the docs to each clusteruse cosine function. I1

is similar to function of hierarchical agglomerative clustering that uses group average heuristics to determine merge.

use cosine function. I2

: vector space of K-means algorithm. Cr : centroid vector of clusters


Criterion Functions (2/5)External Criterion Functions. E1, E2

optimize a function that different from each clusterexternal function derived that the centroid vectors of the different clusters as orthogonal as possible

C : the centroid vector of the entire docs

D : the composite vector of the entire docs. 1/||D|| is constant.


Criterion Functions (3/5)

define with the Euclidean distance function.

Hybrid Criterion Functions. H1, H2maximize the similarity of docs in each cluster, minimize the similarity between the cluster’s docs and the entire docsH1. combine criterion function I1, E1


Criterion Functions (4/5)H2. combine criterion function I2, E1

Graph Based Criterion Functionsview the relations between docs is to use graphsG1 : computing pairwise similarities between the docsG2 : computing pairwise similarities between the docs and terms

S : given collection of n docsGs : similarity graph


Criterion Functions (5/5)G1.

G2.


Experimental ResultsDirect k-way Clustering


Experimental Results


Data Sets‘the Natural Science’ category in Naver directory (http://dir.naver.com)6 subcategories in corpora

1,215 docs, 17,223 terms, 20 clusters, 5 features per a doc, idf

Sub Category No. of Docs.

Sub Category No. of Docs.

Physics 102 Earth science 149

Biology 426 Astrology 323

Mathematics 102 Chemistry 113

Total 1,215


Experimental parametersAlgorithms

rd, rdrk-1 repeated bisections. (rdr : optimize the criterion function)

directcomputed by simultaneously finding all k clusters

agglothe agglomerative paradigm

graphusing a nearest-neighbor graph


Experimental parameters

Criterion Functionsi1, i2, e1, g1, g1p, h1, h2, clink, slink

Similarity Functionscosine measure


Experimental resultsEntropy

rb rbr direct agglo graphI1 .464 .452 .490 .642 .417

I2 .379 .375 .374 .564

E1 .388 .398 .416 .540

G1 .389 .418 .398 .895

G1p .326 .366 .391 .562

H1 .386 .392 .386 .541

H2 .348 .352 .367 .559

Clink .761

slink .895


Entropy

00.10.20.30.40.50.60.70.80.9

rd rdr direct agglo graph

I1I2E1G1G1pH1H2Clinkslink


Experimental resultsPurity

rb rbr direct agglo graphI1 .686 .690 .683 .548 .749

I2 .772 .762 .761 .629

E1 .741 .737 .723 .647

G1 .768 .739 .752 .367

G1p .780 .758 .758 .647

H1 .753 .744 .758 .634

H2 .780 .782 .751 .650

Clink .458 Cut functio

nsslink .368


Purity

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

rb rbr direct agglo graph

I1I2E1G1G1pH1H2Clinkslink


Best results

rb rbr direct agglo graphentr puri entr puri entr puri entr puri entr puri

g1p h2 h1 h1 cut0.32

60.78

00.35

20.78

20.38

60.75

80.54

10.63

40.41

70.74

9

Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion...

Documents

Transcript of Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion...