1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann,...

1

Similarity of Documents and Document Collections using

attributes with low noise

Chris Biemann, Uwe QuasthoffIfi, NLP Department

University of Leipzig, Germany

Monday 5, 2007

WEBIST07 Barcelona

2

Outline

• Motivation

• Attributes with Low Noise- Low frequency terms- Link similarity

• Chinese Whispers Graph Clustering

• Experimental Result- Low frequency terms- Link similarity

• Conclusion

3

Motivation• Document clustering groups documents in meaningful

clusters that can be used for- document collection overview - associative browsing- basis for multi-document summarisation- ...

• In the WWW, documents can be characterized by at least- Terms contained in the document- (external) links from and to the document

• In a WWW setting, the clustering algorithm must be efficient, as datasets are huge

• We use a graph representation and graph clustering

4

Low Frequency Terms

• Documents are more similar, the more low frequncy terms they share

• For IR, this is not a good idea, but for clustering.• Restriction on low frequency terms reduces noise (no stop

words) and allows efficient computation of similarity graph:

For each word do { list all pairs of documents containing this word;

sort the resulting list of pairs;

}

For each pair (i,j) in this list, count the number of occurrences as sij;

5

Co-occurrence of links• Web pages are regarded more similar, the more often

other pages contain a link to both• External links are a good source of information, as they are

normally set up intellectually• Co-occurrence computation is a standard method in NLP

and can be performed efficiently

6

Graph Representation

• Many datasets are naturally represented as graph with nodes encoding entities and edges encoding their relation

• In nature, many graphs possess the small world property, which especially exhibits skewed distributions that are not grasped well in vector space models

• Here, documents form nodes and edges indicate statistically extracted relations between them

7

Dataset: Terms

• Part of year 2000's German press newswire• 202,086 documents, classified in 309 classes• Classification is used to measure quality

Classes in the dpa corpus

1

10

100

1000

10000

100000

0 100 200 300

class number ordered by size

nu

mb

er o

f d

ocu

men

ts

Class size distribution

8

Dataset: Links• Part of German Web• No classification available -> manual evaluation• Two datasets: servers and URLstype # nodes # of edges # nodes with edges

servers 2,201,421 18,892,068 876,577

URLs 680,239 19,465,650 624,332

Cluster Size Distribution

110

1001000

10000100000

1 10 100 1000 10000 100000

Cluster size

# o

f cl

ust

ers

URLs Servers

9

Chinese Whispers Algorithm

• Nodes have a class and communicate it to their adjacent nodes

• A node adopts one of the the majority class in its neighbourhood

• Nodes are processed in random order for some iterations

Algorithm:

initialize:forall vi in V: class(vi)=i;

while changes:

forall v in V, randomized order:

class(v)=highest ranked class in neighborhood of v;

AL1

DL2

EL3

BL4

CL3

58

63

deg=1deg=2

deg=3deg=5

deg=4

10

Example: CW-Partitioning in two steps

11

Properties of CWPRO:• Efficiency: CW is time-linear in the number of edges. This

is bound by n² with n= number of nodes, but in real world data, graphs are much sparser

• Parameter-free: this includes number of clusters

CON:• Non-deterministic: due to random order processing and

possible ties w.r.t. the majority.• Does not converge: See tie example:

However, the CONs are not severe for real world data...

12

Experiments with Terms

Let D = {d1, ... dq} be the set of documents, G = {G1, ... Gm} the gold standard classification and C = {C1, ... Cp}

be the clustering result. Then, the cluster purity CP is calculated as given:

p

iik

mkp

jj

CG

C

GCCP1

..1

1

||max1

),(

13

Results on Terms• Almost in any case, CW clustering improves the cluster purity

compared to components. • The lower the threshold t, the worse are the results in general, and

the larger is the improvement, especially when breaking very large components into smaller clusters.

• It is possible to obtain very high cluster purity values by simply increasing t, but at the cost of reducing coverage significantly. A typical precision/recall trade off arises.

14

Results on URLsExamining 20 randomly chosen clusters with a size around 100, the

results can be divided into

• (6) aggressive interlinking on the same server: pharmacy, concert tickets, celebrity pictures (4)

• (5) link farms: servers with different names, but of same origin: a bookstore, gambling, two different pornography farms and a Turkish link farm

• (3) serious portals that contain many intra-server links: a web directory, a news portal, a city portal

• (3) thematic clusters of different origins: Polish hotels, USA golf, Asian hotels

• (2) mixed clusters with several types of sites

• (1) partially same server, partially thematic cluster: hotels and insurances in India

15

Results on Servers

We randomly chose 20 clusters with a size around 100, which can be described as follows:

• (9) thematically related clusters: software, veg(etari)an, Munich technical institutes, porn, city of Ulm, LAN parties, satellite TV, Uni Osnabrück, astronomy

• (6) mixed but dominated by one topic: bloggers, Swiss web design, link farm, motor racing, Uni Mainz, media in Austria

• (2) link farms using different domains• (3) more or less unrelated clusters

16

Summary

• Efficient methods for constructing similarity graphs of (web) documents

• Experiments show that similarity measure is useful• Efficient graph clustering for large datasets• Methodology to discover link farms• Examining differences of the similarity sources could give rise

to a combined measure

Download a GUI implementation in Java of Chinese Whispers (Open Source) at

http://wortschatz.informatik.uni-leipzig.de/~cbiemann/software/CW.html

17

Questions ?

THANK YOU

1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann,...

Documents

Transcript of 1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann,...