Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion...
-
Upload
ashanti-dison -
Category
Documents
-
view
223 -
download
2
Transcript of Clustering for web documents 1 박흠. Clustering for web documents 2 Contents Cluto Criterion...
Clustering for web documents 1
Clustering for web documents
박흠
Clustering for web documents 2
ContentsCluto
Criterion Functions for Document Clustering* Experiments and Analysis (2002)
by Ying Zhao and George Karypis Department of Computer Science, University of Minnesot
a, Minneapolis, MN 55455
Feature selection for web documents(2004)
Clustering for web documents 3
ClutoClustering Toolkit. 2.1.1
Department of Computer Science, University of Minnesota, Minneapolishttp://www-users.cs.umn.edu/~karypis/platform
Linux 2.4.18Sun OS 5.7Win32
programsCLUTO's user callable libraryvclusterscluster
Clustering for web documents 4
ClutoWhat is Cluto.(1/2)
Clustering algorithmspartitional clusteringagglomerative clusteringgraph-partitioning clustering
clustering criterion functionprovide seven different criterion functions
both partitional and agglomerative clustering algorithms
provide some of the more traditional local criteria (e.g., single-link, complete-link, and UPGMA)
agglomerative clustering.
Clustering for web documents 5
ClutoWhat is Cluto.(2/2)
Analyze discovered clustersrelations between the objects assigned to each clusterrelations between the different clustersidentify the features that best describe and/or discriminate each cluster. relationships between the clusters, objects, and features.
operate on very large datasetsthe number of objectsthe number of dimensions.
Clustering for web documents 6
ClutoPrograms
vclusteroperate in the object’s feature space
sclusteroperate in the object’s similarity space.
Interfacevcluster [optional parameters] MatrixFile Ncluster
n*m matrix. rows to objects, cols to features spaceNcluster : number of cluster
Clustering for web documents 7
ClutoParameters of Algorithms
rd, rdrk-1 repeated bisections. (rdr : optimize the criterion function)
directcomputed by simultaneously finding all k clusters
agglothe agglomerative paradigm
graphusing a nearest-neighbor graph
bagglo
Clustering for web documents 8
ClutoParameters of the similarity function
cos the cosine function. default.corr the correlation coefficient.dist the Euclidean distance
applicable when -clmethod=graph.
jacc the extended Jaccard coefficient. applicable when -clmethod=graph.
Clustering for web documents 9
ClutoParameters of the criterion function
i1, i2, e1, g1, g1p, h1, h2
Clustering for web documents 10
ClutoParameters of the criterion function
slink single linkwslink weighted single linkclink complete linkwclink weighted complete linkupgmaUPGMA
cstypefulltreerowmodel, colmodelshowfeatures
Clustering for web documents 11
Clustering for web documents 12
Criterion Functions for Document Clustering Experiments and Analysis (200
2)
by Ying Zhao and George Karypis Department of Computer Science, University of Minnesota, Minneapolis, MN 55455
Clustering for web documents 13
Data Clustering
A.K. JAINMichigan State University
M.N. MURTYIndian Institute of Science
ANDP.J. FLYNN
The Ohio State UniversityACM Computing Surveys
Clustering for web documents 14
Introduction(1/2)Clustering algorithms
Agglomerative algorithmsUPGMA, single-link, complete-link, CURE, ROCK, Chameleon
Partitional algorithmsK-means, K-medoids, Autoclass, graph-partitional-based, spectral-partitional-basedwell suit for large datasets. so fast.
Seven Criterion functionsmeasure intra-cluster similarity, inter-cluster similarity, two combinations. i1, i2, e1, g1, g1p, h1, h2
Clustering for web documents 15
Introduction(2/2)Datasets
15 different data sets
Clustering for web documents 16
Preliminaries(1/3)Document Representation
use vector space model for each document
d : document, tf : term frequency, tfi : frequency of i-th term in the doc
use idf or tf*idf N : total documents
Similarity MeasuresThe similarity between two docs di, dj
Cosine functions ||d|| : normalize the length of doc vector
1 : identical, 0 : nothing in common
Clustering for web documents 17
Preliminaries(2/3)Euclidean functions
if dis=0, docs are identical, if , nothing in common.
DefinitionsS : set of documents
S1, S2, … Sk : set of document of k-th clusterk : number of clusters
n1, n2, … nk : size docs of the corresponding clustersA : a set of docs
composite vector DA centroid vector CA. sum of all docs vector in A average the weight of terms of docs in A
Clustering for web documents 18
Preliminaries(3/3)Vector Properties
Si, Sj : two sets of docs containing ni, nj documents
Di, Dj : the composite vector, Ci, Cj : the centroid vector
The sum of the pair similarity between the docs in Si and Sj is Dj
tDj
The sum of the pair similarity between the docs in Si is ||Di||2
Clustering for web documents 19
Criterion Functions(1/5)Internal Criterion Functions
maximize sum of the average pairwise similarities between the docs to each clusteruse cosine function. I1
is similar to function of hierarchical agglomerative clustering that uses group average heuristics to determine merge.
use cosine function. I2
: vector space of K-means algorithm. Cr : centroid vector of clusters
Clustering for web documents 20
Criterion Functions (2/5)External Criterion Functions. E1, E2
optimize a function that different from each clusterexternal function derived that the centroid vectors of the different clusters as orthogonal as possible
C : the centroid vector of the entire docs
D : the composite vector of the entire docs. 1/||D|| is constant.
Clustering for web documents 21
Criterion Functions (3/5)
define with the Euclidean distance function.
Hybrid Criterion Functions. H1, H2maximize the similarity of docs in each cluster, minimize the similarity between the cluster’s docs and the entire docsH1. combine criterion function I1, E1
Clustering for web documents 22
Criterion Functions (4/5)H2. combine criterion function I2, E1
Graph Based Criterion Functionsview the relations between docs is to use graphsG1 : computing pairwise similarities between the docsG2 : computing pairwise similarities between the docs and terms
S : given collection of n docsGs : similarity graph
Clustering for web documents 23
Criterion Functions (5/5)G1.
G2.
Clustering for web documents 24
Clustering for web documents 25
Clustering for web documents 26
Experimental ResultsDirect k-way Clustering
Clustering for web documents 27
Experimental Results
Clustering for web documents 28
Experimental Results
Clustering for web documents 29
Data Sets‘the Natural Science’ category in Naver directory (http://dir.naver.com)6 subcategories in corpora
1,215 docs, 17,223 terms, 20 clusters, 5 features per a doc, idf
Sub Category No. of Docs.
Sub Category No. of Docs.
Physics 102 Earth science 149
Biology 426 Astrology 323
Mathematics 102 Chemistry 113
Total 1,215
Clustering for web documents 30
Experimental parametersAlgorithms
rd, rdrk-1 repeated bisections. (rdr : optimize the criterion function)
directcomputed by simultaneously finding all k clusters
agglothe agglomerative paradigm
graphusing a nearest-neighbor graph
Clustering for web documents 31
Experimental parameters
Criterion Functionsi1, i2, e1, g1, g1p, h1, h2, clink, slink
Similarity Functionscosine measure
Clustering for web documents 32
Experimental resultsEntropy
rb rbr direct agglo graphI1 .464 .452 .490 .642 .417
I2 .379 .375 .374 .564
E1 .388 .398 .416 .540
G1 .389 .418 .398 .895
G1p .326 .366 .391 .562
H1 .386 .392 .386 .541
H2 .348 .352 .367 .559
Clink .761
slink .895
Clustering for web documents 33
Entropy
00.10.20.30.40.50.60.70.80.9
rd rdr direct agglo graph
I1I2E1G1G1pH1H2Clinkslink
Clustering for web documents 34
Experimental resultsPurity
rb rbr direct agglo graphI1 .686 .690 .683 .548 .749
I2 .772 .762 .761 .629
E1 .741 .737 .723 .647
G1 .768 .739 .752 .367
G1p .780 .758 .758 .647
H1 .753 .744 .758 .634
H2 .780 .782 .751 .650
Clink .458 Cut functio
nsslink .368
Clustering for web documents 35
Purity
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
rb rbr direct agglo graph
I1I2E1G1G1pH1H2Clinkslink
Clustering for web documents 36
Best results
rb rbr direct agglo graphentr puri entr puri entr puri entr puri entr puri
g1p h2 h1 h1 cut0.32
60.78
00.35
20.78
20.38
60.75
80.54
10.63
40.41
70.74
9