1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann,...

17
1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig, Germany Monday 5, 2007 WEBIST07 Barcelona
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    221
  • download

    0

Transcript of 1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann,...

Page 1: 1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig,

1

Similarity of Documents and Document Collections using

attributes with low noise

Chris Biemann, Uwe QuasthoffIfi, NLP Department

University of Leipzig, Germany

Monday 5, 2007

WEBIST07 Barcelona

Page 2: 1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig,

2

Outline

• Motivation

• Attributes with Low Noise- Low frequency terms- Link similarity

• Chinese Whispers Graph Clustering

• Experimental Result- Low frequency terms- Link similarity

• Conclusion

Page 3: 1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig,

3

Motivation• Document clustering groups documents in meaningful

clusters that can be used for- document collection overview - associative browsing- basis for multi-document summarisation- ...

• In the WWW, documents can be characterized by at least- Terms contained in the document- (external) links from and to the document

• In a WWW setting, the clustering algorithm must be efficient, as datasets are huge

• We use a graph representation and graph clustering

Page 4: 1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig,

4

Low Frequency Terms

• Documents are more similar, the more low frequncy terms they share

• For IR, this is not a good idea, but for clustering.• Restriction on low frequency terms reduces noise (no stop

words) and allows efficient computation of similarity graph:

For each word do { list all pairs of documents containing this word;

sort the resulting list of pairs;

}

For each pair (i,j) in this list, count the number of occurrences as sij;

Page 5: 1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig,

5

Co-occurrence of links• Web pages are regarded more similar, the more often

other pages contain a link to both• External links are a good source of information, as they are

normally set up intellectually• Co-occurrence computation is a standard method in NLP

and can be performed efficiently

Page 6: 1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig,

6

Graph Representation

• Many datasets are naturally represented as graph with nodes encoding entities and edges encoding their relation

• In nature, many graphs possess the small world property, which especially exhibits skewed distributions that are not grasped well in vector space models

• Here, documents form nodes and edges indicate statistically extracted relations between them

Page 7: 1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig,

7

Dataset: Terms

• Part of year 2000's German press newswire• 202,086 documents, classified in 309 classes• Classification is used to measure quality

Classes in the dpa corpus

1

10

100

1000

10000

100000

0 100 200 300

class number ordered by size

nu

mb

er o

f d

ocu

men

ts

Class size distribution

Page 8: 1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig,

8

Dataset: Links• Part of German Web• No classification available -> manual evaluation• Two datasets: servers and URLstype # nodes # of edges # nodes with edges

servers 2,201,421 18,892,068 876,577

URLs 680,239 19,465,650 624,332

Cluster Size Distribution

110

1001000

10000100000

1 10 100 1000 10000 100000

Cluster size

# o

f cl

ust

ers

URLs Servers

Page 9: 1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig,

9

Chinese Whispers Algorithm

• Nodes have a class and communicate it to their adjacent nodes

• A node adopts one of the the majority class in its neighbourhood

• Nodes are processed in random order for some iterations

Algorithm:

initialize:forall vi in V: class(vi)=i;

while changes:

forall v in V, randomized order:

class(v)=highest ranked class in neighborhood of v;

AL1

DL2

EL3

BL4

CL3

58

63

deg=1deg=2

deg=3deg=5

deg=4

Page 10: 1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig,

10

Example: CW-Partitioning in two steps

Page 11: 1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig,

11

Properties of CWPRO:• Efficiency: CW is time-linear in the number of edges. This

is bound by n² with n= number of nodes, but in real world data, graphs are much sparser

• Parameter-free: this includes number of clusters

CON:• Non-deterministic: due to random order processing and

possible ties w.r.t. the majority.• Does not converge: See tie example:

However, the CONs are not severe for real world data...

Page 12: 1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig,

12

Experiments with Terms

Let D = {d1, ... dq} be the set of documents, G = {G1, ... Gm} the gold standard classification and C = {C1, ... Cp}

be the clustering result. Then, the cluster purity CP is calculated as given:

 

p

iik

mkp

jj

CG

C

GCCP1

..1

1

||max1

),(

Page 13: 1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig,

13

Results on Terms• Almost in any case, CW clustering improves the cluster purity

compared to components. • The lower the threshold t, the worse are the results in general, and

the larger is the improvement, especially when breaking very large components into smaller clusters.

• It is possible to obtain very high cluster purity values by simply increasing t, but at the cost of reducing coverage significantly. A typical precision/recall trade off arises.

Page 14: 1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig,

14

Results on URLsExamining 20 randomly chosen clusters with a size around 100, the

results can be divided into

• (6) aggressive interlinking on the same server: pharmacy, concert tickets, celebrity pictures (4)

• (5) link farms: servers with different names, but of same origin: a bookstore, gambling, two different pornography farms and a Turkish link farm

• (3) serious portals that contain many intra-server links: a web directory, a news portal, a city portal

• (3) thematic clusters of different origins: Polish hotels, USA golf, Asian hotels

• (2) mixed clusters with several types of sites

• (1) partially same server, partially thematic cluster: hotels and insurances in India

Page 15: 1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig,

15

Results on Servers

We randomly chose 20 clusters with a size around 100, which can be described as follows:

• (9) thematically related clusters: software, veg(etari)an, Munich technical institutes, porn, city of Ulm, LAN parties, satellite TV, Uni Osnabrück, astronomy

• (6) mixed but dominated by one topic: bloggers, Swiss web design, link farm, motor racing, Uni Mainz, media in Austria

• (2) link farms using different domains• (3) more or less unrelated clusters

Page 16: 1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig,

16

Summary

• Efficient methods for constructing similarity graphs of (web) documents

• Experiments show that similarity measure is useful• Efficient graph clustering for large datasets• Methodology to discover link farms• Examining differences of the similarity sources could give rise

to a combined measure

Download a GUI implementation in Java of Chinese Whispers (Open Source) at

http://wortschatz.informatik.uni-leipzig.de/~cbiemann/software/CW.html

Page 17: 1 Similarity of Documents and Document Collections using attributes with low noise Chris Biemann, Uwe Quasthoff Ifi, NLP Department University of Leipzig,

17

Questions ?

THANK YOU