SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS
description
Transcript of SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS
SCATTER/GATHER : A CLUSTER BASED APPROACH FOR BROWSING LARGE DOCUMENT COLLECTIONS
GROUPER : A DYNAMIC CLUSTERING INTERFACE TO WEB SEARCH RESULTS
MINAL PATANKAR MADHURI WUDALI
DOCUMENT CLUSTERING
Process of grouping documents with
similar contents into a common cluster
ADVANTAGES OF DOCUMENT CLUSTERING
If a collection is well clustered, we can search only the cluster that will contain relevant documents Clustering also improves browsing
through the document collection
DOCUMENTCOLLECTION META SEARCH ENGINE
CLUSTERING
TRADITIONAL TEXT-BASED
CLUSTERING ALGORITHM
BUCKSHOT FRACTIONATION
STC
SCATTER /GATHER GROUPER
WORD BASED SIMILARITY
PHRASE BASED SIMILARITY
A TOOL FOR
SEARCHING
A TOOL FOR
BROWSING
INTERFACES
USER
SCATTER /GATHER INTERFACE
SCATTER /GATHER SESSIONUser is presented with short summaries of
a small number of document groups.User selects one or more groups for
further studyContinue this process until the individual
document level
Fractionation
Buckshot
Buckshot
Cluster Digest
HOW IS SCATTER/GATHER DONE?
Static offline partitioning phase Fractionation Algorithm
Online Reclustering phase Buckshot AlgorithmStep 1:Group average agglomerative clustering Step 2: K-Means
Clustering
Partitional
Hybrid
Hierarchical
Single link
Complete Link
Group Average Link
K-Means
Buckshot
Fractionation
Agglomerative Divisive
HIERARCHICAL AGGLOMERATIVE CLUSTERING
• Create NxN doc-doc similarity matrix• Each document starts as a cluster of size one.• Do Until there is only one cluster.– combine the two clusters with the greatest similarity– update the doc-doc matrix
Example A B C D E A _ 2 7 6 4 B 2 _ 9 11 14 C 7 9 _ 4 8 D 6 11 4 _ 2 E 4 14 8 2 _
A B C D E
A BE C D
SC(A,BE) = 4 if we are using single link (take max)SC(A,BE) = 2 if we are using complete linkage (take min)SC(A,BE) = 3 if we are using group average (take average)Note: C - BE is now the highest link
Example A BE C D A _ 3 7 6
BE 3 _ 8.5 6.5
C 7 8.5 _ 4
D 6 6.5 4 _
COMBINING
SC(C,B)=9SC(C,E)=8SC(C,BE)=8.5
BE A C D
BEC
Example A BEC D A _ 5 6 BEC 5 _
5.75 D 6 5.75 _
COMBINING
BEC A D
A,D
SCATTER/GATHER SESSION STAGE
1
FRACTIONATION
•Corpus C is broken into N/m buckets of fixed size m>k•Apply Group average agglomerative clustering on each bucket•Generate document groups, given as input to next iteration•Repeat till ‘k’ centers remain
SCATTER/GATHER SESSION STAGE 2
BUCKSHOT
STEP1 : HAC
•First, randomly takes sample of size sqrt(kn)•Apply the Group average agglomerative clustering till we obtain ‘k’ clusters•Return the obtained clusters
SCATTER /GATHER STAGE 2
BUCKSHOT
STEP2 : K -Means
•Arbitrary select K documents as seeds, they are the initial centroids of each cluster. •Assign all other documents to the closest centroid •Compute the centroid of each cluster again. Get new centroid of each cluster•Repeat step2,3, until the centroid of each cluster doesn’t change.
A C HGFEDB
FEDCA HGB
Bucket 1 Bucket 2
A BG H C FD
E
BG
AH
DE
CF
AH BGCFDE
:::
Gro
up A
vera
ge
Agg
lom
erat
ive
Clu
ster
ing
Frac
tiona
tion
Contd…
A D GE
GA DE
Documents in Sample
Gro
up A
vera
ge
Agg
lom
erat
ive
Clu
ster
ing
AG DE B
ucks
hot
Assign remaining documents to these clusters using
K-means
GENESIS OF GROUPER
GROUPERA dynamic ,web-interface to Husky Search meta-
search engineClusters the top retrieved results of Husky Meta
search engineDynamically group search results into clustersUses STC Algorithm for Clustering
Grouper’s query interface.
Grouper Interface
STC (Suffix Tree Clustering)A Fast , incremental algorithm
Operates on web document- snippets.
Relies on Suffix Tree to identify common phrases
Uses the common information to create clusters
23
WHAT IS A SUFFIX TREE?
24
• A suffix tree is a rooted, directed tree
• Each internal node has at least 2 children
• Each edge is labeled with a non-empty sub-string of S.
• The label of a node is the concatenation of the edge-labels on the path from the root to that node.
• No two edges out of the same node can have edge-labels that begin with the same word.
Step-1: Document “Cleaning”
Step-2: Identifying Base Clusters
Step-3: Combining Base Clusters
Step-4: Score clusters
25
STEPS OF STC
DOCUMENT CLEANING• Stemming• Striping of HTML, Punctuation and numbers
<html>2 Cats ate<b>
cheese</b>.</html>Cat ate cheese
Identifying Base Clusters Create an inverted index of strings from the web document collection with using a suffix tree Each node of the suffix tree represents a group of documents and a string that is common to all of themThe label of the node represents the common stringEach node represents a base cluster.
too
cheese
too
ate
mouse too
cheese
too
cat ate
mouse too
cheese
too
mouse
ate cheese too
2,3
1,2
1,2,31,32,3
1,2
2.mouse ate cheese too
cat
1.cat ate cheese
mouse 3.cat ate mouse
too
cheese
cat ate
cheeseate chees
etoo
ate mouse too
cheese
too
ate cheese too
29
BASE CLUSTERS IDENTIFIED!!
Node Phrase Documents
a cat ate 1,3
b ate 1,2,3
c cheese 1,2
d mouse 2,3
e too 2,3
f ate cheese 1,2
Table 1: Six nodes and their corresponding base clusters
SCORING BASE CLUSTERSScoring clusters
|P| is the number of words in Phrase P
|B| is the number of documents in base cluster B
S(B) = |B | . f (|P|)
Combining Base Clusters
|Bm Λ Bn | > 0.5 |Bm Λ Bn | > 0.5 |Bm| |Bn|
Documents which are in both Clusters
Documents in cluster ‘m’
Documents in Cluster ‘n’
Binary similarity measure:
SIMILARITY
1IF
CONDITION
SATISFIED
OTHERWISEO
mouse
cat ate
cheese
ate
too ate chees
e
1,2
1,3
2,3
2,3
1,2,3
1,2
COMBINING THE BASE CLUSTERSBase cluster graph
STC is IncrementalAs each document arrives from the web,
we “clean” it Add it to the suffix tree. Each node that is
updated/created as a result of this is taggedUpdate the relevant base clusters and
recalculate the similarity of these base clusters to the rest of k highest scoring base clusters
Check any changes to the final clustersScore and sort the final clusters, choose top 10
STC allows cluster overlap…Why overlap is reasonable?
a document often has 1+ topicsSTC allows a document to appear in 1+
clusters, since documents may share 1+ phrases with other documents
REFERENCES http://www.math.unipd.it/~aiolli/corsi/0708/IR/
Lez18.pdfhttp://www.ir.iit.edu/~dagr/cs529/files/
handouts/08Clustering.pdfhttp://www.cs.washington.edu/research/
projects/WebWare1/www/metacrawler/http://sils.unc.edu/research/publications/
reports/TR-2007-06.pdfhttp://www.ir.iit.edu/~dagr/cs529/files/
handouts/08Clustering.pdf