Post on 21-Dec-2015
1
Similarity of Documents and Document Collections using
attributes with low noise
Chris Biemann, Uwe QuasthoffIfi, NLP Department
University of Leipzig, Germany
Monday 5, 2007
WEBIST07 Barcelona
2
Outline
• Motivation
• Attributes with Low Noise- Low frequency terms- Link similarity
• Chinese Whispers Graph Clustering
• Experimental Result- Low frequency terms- Link similarity
• Conclusion
3
Motivation• Document clustering groups documents in meaningful
clusters that can be used for- document collection overview - associative browsing- basis for multi-document summarisation- ...
• In the WWW, documents can be characterized by at least- Terms contained in the document- (external) links from and to the document
• In a WWW setting, the clustering algorithm must be efficient, as datasets are huge
• We use a graph representation and graph clustering
4
Low Frequency Terms
• Documents are more similar, the more low frequncy terms they share
• For IR, this is not a good idea, but for clustering.• Restriction on low frequency terms reduces noise (no stop
words) and allows efficient computation of similarity graph:
For each word do { list all pairs of documents containing this word;
sort the resulting list of pairs;
}
For each pair (i,j) in this list, count the number of occurrences as sij;
5
Co-occurrence of links• Web pages are regarded more similar, the more often
other pages contain a link to both• External links are a good source of information, as they are
normally set up intellectually• Co-occurrence computation is a standard method in NLP
and can be performed efficiently
6
Graph Representation
• Many datasets are naturally represented as graph with nodes encoding entities and edges encoding their relation
• In nature, many graphs possess the small world property, which especially exhibits skewed distributions that are not grasped well in vector space models
• Here, documents form nodes and edges indicate statistically extracted relations between them
7
Dataset: Terms
• Part of year 2000's German press newswire• 202,086 documents, classified in 309 classes• Classification is used to measure quality
Classes in the dpa corpus
1
10
100
1000
10000
100000
0 100 200 300
class number ordered by size
nu
mb
er o
f d
ocu
men
ts
Class size distribution
8
Dataset: Links• Part of German Web• No classification available -> manual evaluation• Two datasets: servers and URLstype # nodes # of edges # nodes with edges
servers 2,201,421 18,892,068 876,577
URLs 680,239 19,465,650 624,332
Cluster Size Distribution
110
1001000
10000100000
1 10 100 1000 10000 100000
Cluster size
# o
f cl
ust
ers
URLs Servers
9
Chinese Whispers Algorithm
• Nodes have a class and communicate it to their adjacent nodes
• A node adopts one of the the majority class in its neighbourhood
• Nodes are processed in random order for some iterations
Algorithm:
initialize:forall vi in V: class(vi)=i;
while changes:
forall v in V, randomized order:
class(v)=highest ranked class in neighborhood of v;
AL1
DL2
EL3
BL4
CL3
58
63
deg=1deg=2
deg=3deg=5
deg=4
10
Example: CW-Partitioning in two steps
11
Properties of CWPRO:• Efficiency: CW is time-linear in the number of edges. This
is bound by n² with n= number of nodes, but in real world data, graphs are much sparser
• Parameter-free: this includes number of clusters
CON:• Non-deterministic: due to random order processing and
possible ties w.r.t. the majority.• Does not converge: See tie example:
However, the CONs are not severe for real world data...
12
Experiments with Terms
Let D = {d1, ... dq} be the set of documents, G = {G1, ... Gm} the gold standard classification and C = {C1, ... Cp}
be the clustering result. Then, the cluster purity CP is calculated as given:
p
iik
mkp
jj
CG
C
GCCP1
..1
1
||max1
),(
13
Results on Terms• Almost in any case, CW clustering improves the cluster purity
compared to components. • The lower the threshold t, the worse are the results in general, and
the larger is the improvement, especially when breaking very large components into smaller clusters.
• It is possible to obtain very high cluster purity values by simply increasing t, but at the cost of reducing coverage significantly. A typical precision/recall trade off arises.
14
Results on URLsExamining 20 randomly chosen clusters with a size around 100, the
results can be divided into
• (6) aggressive interlinking on the same server: pharmacy, concert tickets, celebrity pictures (4)
• (5) link farms: servers with different names, but of same origin: a bookstore, gambling, two different pornography farms and a Turkish link farm
• (3) serious portals that contain many intra-server links: a web directory, a news portal, a city portal
• (3) thematic clusters of different origins: Polish hotels, USA golf, Asian hotels
• (2) mixed clusters with several types of sites
• (1) partially same server, partially thematic cluster: hotels and insurances in India
15
Results on Servers
We randomly chose 20 clusters with a size around 100, which can be described as follows:
• (9) thematically related clusters: software, veg(etari)an, Munich technical institutes, porn, city of Ulm, LAN parties, satellite TV, Uni Osnabrück, astronomy
• (6) mixed but dominated by one topic: bloggers, Swiss web design, link farm, motor racing, Uni Mainz, media in Austria
• (2) link farms using different domains• (3) more or less unrelated clusters
16
Summary
• Efficient methods for constructing similarity graphs of (web) documents
• Experiments show that similarity measure is useful• Efficient graph clustering for large datasets• Methodology to discover link farms• Examining differences of the similarity sources could give rise
to a combined measure
Download a GUI implementation in Java of Chinese Whispers (Open Source) at
http://wortschatz.informatik.uni-leipzig.de/~cbiemann/software/CW.html
17
Questions ?
THANK YOU